Unfortunately we couldn’t find the root cause of such behaviour in Solr 9 and thus forced to rollback to 8.11.
Does anyone else face similar to the issues mentioned in this thread? Any ideas how we should proceed in such case? Thanks --- Nick Vladiceanu vladicean...@gmail.com > On 9. Dec 2022, at 16:04, Nick Vladiceanu <vladicean...@gmail.com> wrote: > > tried to enable the -Dsolr.http1=true but it didn’t help. Seeing timeout > after 180s (even without sending any traffic to the cluster) and also noticed > > Caused by: java.util.concurrent.TimeoutException: Total timeout 600000 > ms elapsed (stacktrace here https://justpaste.it/29bpv) > > on some of the nodes. > > > Also, spotting errors related to: > o.a.s.c.SolrCore java.lang.IllegalArgumentException: Unknown directory: > MMapDirectory@/var/solr/data/my_collection_shard3_replica_t1643/data/snapshot_metadata > (we do not use snapshots at all) (stacktrace https://justpaste.it/88en6 ) > CoreIsClosedException o.a.s.u.CommitTracker auto commit error...: > https://justpaste.it/bbbms > org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException: > Error from server at null https://justpaste.it/5nq7b (this node is a leader) > > From time to time observing in the logs (TLOG replicas across the board) > across multiple nodes: > WARN (indexFetcher-120-thread-1) [] o.a.s.h.IndexFetcher File _8ux.cfe did > not match. expected checksum is 3843994300 and actual is checksum 2148229542. > expected length is 542 and actual length is 542 > > > >> On 5. Dec 2022, at 5:12 PM, Houston Putman <hous...@apache.org >> <mailto:hous...@apache.org>> wrote: >> >> I'm not sure this is the issue, but maybe its http2 vs http1. >> >> Could you retry with the following set on the cluster? >> >> -Dsolr.http1=true >> >> >> >> On Mon, Dec 5, 2022 at 5:08 AM Nick Vladiceanu <vladicean...@gmail.com >> <mailto:vladicean...@gmail.com>> >> wrote: >> >>> Hello folks, >>> >>> We’re running our SolrCloud cluster in Kubernetes. Recently we’ve upgraded >>> from 8.11 to 9.0 (and eventually to 9.1). >>> >>> Fully reindexed collections after upgrade, all looking good, no errors, >>> response time improvements are noticed. >>> >>> We have the following specs: >>> collection size: >>> 22M docs, 1.3Kb doc size; ~28Gb total collection size at this point; >>> shards: 6 shards, each ~4,7Gb; 1 core per node; >>> nodes: >>> 30Gi of RAM, >>> 16 cores >>> 96 nodes >>> Heap: 23Gb heap >>> JavaOpts: -Dsolr.modules=scripting,analysis-extras,ltr” >>> gcTune: -XX:+UseG1GC -XX:G1HeapRegionSize=16m -XX:MaxGCPauseMillis=300 >>> -XX:InitiatingHeapOccupancyPercent=75 -XX:+UseLargePages >>> -XX:+ParallelRefProcEnabled -XX:ParallelGCThreads=10 -XX:ConcGCThreads=2 >>> -XX:MinHeapFreeRatio=2 -XX:MaxHeapFreeRatio=10 >>> >>> >>> Problem >>> >>> The problem we face is when we try to reload the collection, in sync mode >>> we’re getting timed out or forever running task if reload executed in async >>> mode: >>> >>> curl “reload” output: https://justpaste.it/ap4d2 < >>> https://justpaste.it/ap4d2> >>> ErrorReportingConcurrentUpdateSolrClient stacktrace (appears in the logs >>> of some nodes): https://justpaste.it/aq3dw <https://justpaste.it/aq3dw> >>> >>> There are no issues on a newly created cluster if there is no incoming >>> traffic to it. Once we start sending requests to the cluster, collection >>> reload becomes impossible. Other collections (smaller) within the same >>> cluster are reloading just fine. >>> >>> In some cases, on some node the Old generation GC is kicking in and makes >>> the entire cluster unstable, however, that doesn’t all the time when >>> collection reload is timing out. >>> >>> We’ve tried the rollback to 8.11 and everything works normally as it used >>> to be, no errors with reload, no other errors in the logs during reload, >>> etc. >>> >>> We tried the following: >>> run 9.0, 9.1 on Java 11 and Java 17: same result; >>> lower cache warming, disable firstSearcher queries: same result; >>> increase heap size, tune gc: same result; >>> use apiv1 and apiv2 to issue reload commands: no difference; >>> sync vs async reload: either forever running task or timing out after 180 >>> seconds; >>> >>> Did anyone face similar issues after upgrading to version 9 of Solr? Could >>> you please advice where should we focus our attention while debugging this >>> behavior? Any other advices/suggestions? >>> >>> Thank you >>> >>> >>> Best regards, >>> Nick Vladiceanu >