tried to enable the -Dsolr.http1=true but it didn’t help. Seeing timeout after 
180s (even without sending any traffic to the cluster) and also noticed 

        Caused by: java.util.concurrent.TimeoutException: Total timeout 600000 
ms elapsed (stacktrace here https://justpaste.it/29bpv 
<https://justpaste.it/29bpv>)

on some of the nodes. 


Also, spotting errors related to:
o.a.s.c.SolrCore java.lang.IllegalArgumentException: Unknown directory: 
MMapDirectory@/var/solr/data/my_collection_shard3_replica_t1643/data/snapshot_metadata
 (we do not use snapshots at all) (stacktrace https://justpaste.it/88en6 
<https://justpaste.it/88en6> )
CoreIsClosedException o.a.s.u.CommitTracker auto commit error...: 
https://justpaste.it/bbbms <https://justpaste.it/bbbms> 
org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException: Error 
from server at null  https://justpaste.it/5nq7b <https://justpaste.it/5nq7b> 
(this node is a leader)

From time to time observing in the logs (TLOG replicas across the board) across 
multiple nodes:
WARN  (indexFetcher-120-thread-1) [] o.a.s.h.IndexFetcher File _8ux.cfe did not 
match. expected checksum is 3843994300 and actual is checksum 2148229542. 
expected length is 542 and actual length is 542



> On 5. Dec 2022, at 5:12 PM, Houston Putman <hous...@apache.org> wrote:
> 
> I'm not sure this is the issue, but maybe its http2 vs http1.
> 
> Could you retry with the following set on the cluster?
> 
> -Dsolr.http1=true
> 
> 
> 
> On Mon, Dec 5, 2022 at 5:08 AM Nick Vladiceanu <vladicean...@gmail.com 
> <mailto:vladicean...@gmail.com>>
> wrote:
> 
>> Hello folks,
>> 
>> We’re running our SolrCloud cluster in Kubernetes. Recently we’ve upgraded
>> from 8.11 to 9.0 (and eventually to 9.1).
>> 
>> Fully reindexed collections after upgrade, all looking good, no errors,
>> response time improvements are noticed.
>> 
>> We have the following specs:
>> collection size:
>> 22M docs, 1.3Kb doc size; ~28Gb total collection size at this point;
>> shards: 6 shards, each ~4,7Gb; 1 core per node;
>> nodes:
>> 30Gi of RAM,
>> 16 cores
>> 96 nodes
>> Heap: 23Gb heap
>> JavaOpts: -Dsolr.modules=scripting,analysis-extras,ltr”
>> gcTune: -XX:+UseG1GC -XX:G1HeapRegionSize=16m -XX:MaxGCPauseMillis=300
>> -XX:InitiatingHeapOccupancyPercent=75 -XX:+UseLargePages
>> -XX:+ParallelRefProcEnabled -XX:ParallelGCThreads=10 -XX:ConcGCThreads=2
>> -XX:MinHeapFreeRatio=2 -XX:MaxHeapFreeRatio=10
>> 
>> 
>> Problem
>> 
>> The problem we face is when we try to reload the collection, in sync mode
>> we’re getting timed out or forever running task if reload executed in async
>> mode:
>> 
>> curl “reload” output: https://justpaste.it/ap4d2 <
>> https://justpaste.it/ap4d2 <https://justpaste.it/ap4d2>>
>> ErrorReportingConcurrentUpdateSolrClient stacktrace (appears in the logs
>> of some nodes): https://justpaste.it/aq3dw <https://justpaste.it/aq3dw> 
>> <https://justpaste.it/aq3dw <https://justpaste.it/aq3dw>>
>> 
>> There are no issues on a newly created cluster if there is no incoming
>> traffic to it. Once we start sending requests to the cluster, collection
>> reload becomes impossible. Other collections (smaller) within the same
>> cluster are reloading just fine.
>> 
>> In some cases, on some node the Old generation GC is kicking in and makes
>> the entire cluster unstable, however, that doesn’t all the time when
>> collection reload is timing out.
>> 
>> We’ve tried the rollback to 8.11 and everything works normally as it used
>> to be, no errors with reload, no other errors in the logs during reload,
>> etc.
>> 
>> We tried the following:
>> run 9.0, 9.1 on Java 11 and Java 17: same result;
>> lower cache warming, disable firstSearcher queries: same result;
>> increase heap size, tune gc: same result;
>> use apiv1 and apiv2 to issue reload commands: no difference;
>> sync vs async reload: either forever running task or timing out after 180
>> seconds;
>> 
>> Did anyone face similar issues after upgrading to version 9 of Solr? Could
>> you please advice where should we focus our attention while debugging this
>> behavior? Any other advices/suggestions?
>> 
>> Thank you
>> 
>> 
>> Best regards,
>> Nick Vladiceanu

Reply via email to