tried to enable the -Dsolr.http1=true but it didn’t help. Seeing timeout after 180s (even without sending any traffic to the cluster) and also noticed
Caused by: java.util.concurrent.TimeoutException: Total timeout 600000 ms elapsed (stacktrace here https://justpaste.it/29bpv <https://justpaste.it/29bpv>) on some of the nodes. Also, spotting errors related to: o.a.s.c.SolrCore java.lang.IllegalArgumentException: Unknown directory: MMapDirectory@/var/solr/data/my_collection_shard3_replica_t1643/data/snapshot_metadata (we do not use snapshots at all) (stacktrace https://justpaste.it/88en6 <https://justpaste.it/88en6> ) CoreIsClosedException o.a.s.u.CommitTracker auto commit error...: https://justpaste.it/bbbms <https://justpaste.it/bbbms> org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException: Error from server at null https://justpaste.it/5nq7b <https://justpaste.it/5nq7b> (this node is a leader) From time to time observing in the logs (TLOG replicas across the board) across multiple nodes: WARN (indexFetcher-120-thread-1) [] o.a.s.h.IndexFetcher File _8ux.cfe did not match. expected checksum is 3843994300 and actual is checksum 2148229542. expected length is 542 and actual length is 542 > On 5. Dec 2022, at 5:12 PM, Houston Putman <hous...@apache.org> wrote: > > I'm not sure this is the issue, but maybe its http2 vs http1. > > Could you retry with the following set on the cluster? > > -Dsolr.http1=true > > > > On Mon, Dec 5, 2022 at 5:08 AM Nick Vladiceanu <vladicean...@gmail.com > <mailto:vladicean...@gmail.com>> > wrote: > >> Hello folks, >> >> We’re running our SolrCloud cluster in Kubernetes. Recently we’ve upgraded >> from 8.11 to 9.0 (and eventually to 9.1). >> >> Fully reindexed collections after upgrade, all looking good, no errors, >> response time improvements are noticed. >> >> We have the following specs: >> collection size: >> 22M docs, 1.3Kb doc size; ~28Gb total collection size at this point; >> shards: 6 shards, each ~4,7Gb; 1 core per node; >> nodes: >> 30Gi of RAM, >> 16 cores >> 96 nodes >> Heap: 23Gb heap >> JavaOpts: -Dsolr.modules=scripting,analysis-extras,ltr” >> gcTune: -XX:+UseG1GC -XX:G1HeapRegionSize=16m -XX:MaxGCPauseMillis=300 >> -XX:InitiatingHeapOccupancyPercent=75 -XX:+UseLargePages >> -XX:+ParallelRefProcEnabled -XX:ParallelGCThreads=10 -XX:ConcGCThreads=2 >> -XX:MinHeapFreeRatio=2 -XX:MaxHeapFreeRatio=10 >> >> >> Problem >> >> The problem we face is when we try to reload the collection, in sync mode >> we’re getting timed out or forever running task if reload executed in async >> mode: >> >> curl “reload” output: https://justpaste.it/ap4d2 < >> https://justpaste.it/ap4d2 <https://justpaste.it/ap4d2>> >> ErrorReportingConcurrentUpdateSolrClient stacktrace (appears in the logs >> of some nodes): https://justpaste.it/aq3dw <https://justpaste.it/aq3dw> >> <https://justpaste.it/aq3dw <https://justpaste.it/aq3dw>> >> >> There are no issues on a newly created cluster if there is no incoming >> traffic to it. Once we start sending requests to the cluster, collection >> reload becomes impossible. Other collections (smaller) within the same >> cluster are reloading just fine. >> >> In some cases, on some node the Old generation GC is kicking in and makes >> the entire cluster unstable, however, that doesn’t all the time when >> collection reload is timing out. >> >> We’ve tried the rollback to 8.11 and everything works normally as it used >> to be, no errors with reload, no other errors in the logs during reload, >> etc. >> >> We tried the following: >> run 9.0, 9.1 on Java 11 and Java 17: same result; >> lower cache warming, disable firstSearcher queries: same result; >> increase heap size, tune gc: same result; >> use apiv1 and apiv2 to issue reload commands: no difference; >> sync vs async reload: either forever running task or timing out after 180 >> seconds; >> >> Did anyone face similar issues after upgrading to version 9 of Solr? Could >> you please advice where should we focus our attention while debugging this >> behavior? Any other advices/suggestions? >> >> Thank you >> >> >> Best regards, >> Nick Vladiceanu