Unfortunately we couldn’t find the root cause of such behaviour in Solr 9 and 
thus forced to rollback to 8.11.

Does anyone else face similar to the issues mentioned in this thread? Any ideas 
how we should proceed in such case?

Thanks


---
Nick Vladiceanu
vladicean...@gmail.com 




> On 9. Dec 2022, at 16:04, Nick Vladiceanu <vladicean...@gmail.com> wrote:
> 
> tried to enable the -Dsolr.http1=true but it didn’t help. Seeing timeout 
> after 180s (even without sending any traffic to the cluster) and also noticed 
> 
>       Caused by: java.util.concurrent.TimeoutException: Total timeout 600000 
> ms elapsed (stacktrace here https://justpaste.it/29bpv)
> 
> on some of the nodes. 
> 
> 
> Also, spotting errors related to:
> o.a.s.c.SolrCore java.lang.IllegalArgumentException: Unknown directory: 
> MMapDirectory@/var/solr/data/my_collection_shard3_replica_t1643/data/snapshot_metadata
>  (we do not use snapshots at all) (stacktrace https://justpaste.it/88en6 )
> CoreIsClosedException o.a.s.u.CommitTracker auto commit error...: 
> https://justpaste.it/bbbms 
> org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException: 
> Error from server at null  https://justpaste.it/5nq7b (this node is a leader)
> 
> From time to time observing in the logs (TLOG replicas across the board) 
> across multiple nodes:
> WARN  (indexFetcher-120-thread-1) [] o.a.s.h.IndexFetcher File _8ux.cfe did 
> not match. expected checksum is 3843994300 and actual is checksum 2148229542. 
> expected length is 542 and actual length is 542
> 
> 
> 
>> On 5. Dec 2022, at 5:12 PM, Houston Putman <hous...@apache.org 
>> <mailto:hous...@apache.org>> wrote:
>> 
>> I'm not sure this is the issue, but maybe its http2 vs http1.
>> 
>> Could you retry with the following set on the cluster?
>> 
>> -Dsolr.http1=true
>> 
>> 
>> 
>> On Mon, Dec 5, 2022 at 5:08 AM Nick Vladiceanu <vladicean...@gmail.com 
>> <mailto:vladicean...@gmail.com>>
>> wrote:
>> 
>>> Hello folks,
>>> 
>>> We’re running our SolrCloud cluster in Kubernetes. Recently we’ve upgraded
>>> from 8.11 to 9.0 (and eventually to 9.1).
>>> 
>>> Fully reindexed collections after upgrade, all looking good, no errors,
>>> response time improvements are noticed.
>>> 
>>> We have the following specs:
>>> collection size:
>>> 22M docs, 1.3Kb doc size; ~28Gb total collection size at this point;
>>> shards: 6 shards, each ~4,7Gb; 1 core per node;
>>> nodes:
>>> 30Gi of RAM,
>>> 16 cores
>>> 96 nodes
>>> Heap: 23Gb heap
>>> JavaOpts: -Dsolr.modules=scripting,analysis-extras,ltr”
>>> gcTune: -XX:+UseG1GC -XX:G1HeapRegionSize=16m -XX:MaxGCPauseMillis=300
>>> -XX:InitiatingHeapOccupancyPercent=75 -XX:+UseLargePages
>>> -XX:+ParallelRefProcEnabled -XX:ParallelGCThreads=10 -XX:ConcGCThreads=2
>>> -XX:MinHeapFreeRatio=2 -XX:MaxHeapFreeRatio=10
>>> 
>>> 
>>> Problem
>>> 
>>> The problem we face is when we try to reload the collection, in sync mode
>>> we’re getting timed out or forever running task if reload executed in async
>>> mode:
>>> 
>>> curl “reload” output: https://justpaste.it/ap4d2 <
>>> https://justpaste.it/ap4d2>
>>> ErrorReportingConcurrentUpdateSolrClient stacktrace (appears in the logs
>>> of some nodes): https://justpaste.it/aq3dw <https://justpaste.it/aq3dw>
>>> 
>>> There are no issues on a newly created cluster if there is no incoming
>>> traffic to it. Once we start sending requests to the cluster, collection
>>> reload becomes impossible. Other collections (smaller) within the same
>>> cluster are reloading just fine.
>>> 
>>> In some cases, on some node the Old generation GC is kicking in and makes
>>> the entire cluster unstable, however, that doesn’t all the time when
>>> collection reload is timing out.
>>> 
>>> We’ve tried the rollback to 8.11 and everything works normally as it used
>>> to be, no errors with reload, no other errors in the logs during reload,
>>> etc.
>>> 
>>> We tried the following:
>>> run 9.0, 9.1 on Java 11 and Java 17: same result;
>>> lower cache warming, disable firstSearcher queries: same result;
>>> increase heap size, tune gc: same result;
>>> use apiv1 and apiv2 to issue reload commands: no difference;
>>> sync vs async reload: either forever running task or timing out after 180
>>> seconds;
>>> 
>>> Did anyone face similar issues after upgrading to version 9 of Solr? Could
>>> you please advice where should we focus our attention while debugging this
>>> behavior? Any other advices/suggestions?
>>> 
>>> Thank you
>>> 
>>> 
>>> Best regards,
>>> Nick Vladiceanu
> 

Reply via email to