This sounds a lot like this issue: 
https://issues.apache.org/jira/browse/SOLR-18087

Especially the:
> (downloaded 320864256 of 511797594 bytes)

IME if the download just stops midway its usually because
of a deadlocked HTTP2 connection from follower to leader.

The workaround is to run with -Dsolr.http1=true. The 
underlying issue is still unresolved AFAIK (I remember
still seeing this when running from main fairly recently).


From: [email protected] At: 06/02/26 00:20:59 UTC-4:00To:  
[email protected]
Cc:  [email protected]
Subject: Solr replica recovery failures

Hello,
We have the Solr cloud setup on our Teamcenter PROD environment as below:
Solr + ZooKeeper : server1 OSLO (Leader for all replicas)
Solr + ZooKeeper : server2 HOU (Follower)
Solr + ZooKeeper : server3 SNG (Follower)

Solr version : 9.7.0
No of shards: 10 shards for collection1
No of replicas: 3 per shard
Java: JDK21
Teamcenter : 2412.0006
We are experiencing that, sometimes, all of a sudden one of the follower server 
HOU/SNG lags behind and all of the HOU/SNG replicas gets stuck in recovery mode 
for infinite time. As per the solr logs, during the replica recovery, it tries 
to copy segment files from the leader solr and it fails with 2 minutes timeouts 
while copying files above ~150 MB size.
 The solr.log has errors as :
"WARN (recoveryExecutor-12-thread-9-processing-hou4140.verit.dnv.com:8984_solr 
collection1_shard4_replica_n38 collection1 shard4 core_node41) [c:collection1 
s:shard4 r:core_node41 x:collection1_shard4_replica_n38 t:] 
o.a.s.h.IndexFetcher Error in fetching file: _1jnu_Lucene99_0.pos (downloaded 
320864256 of 511797594 bytes) => java.io.IOException: 
java.util.concurrent.TimeoutException: Total timeout 120000 ms elapsed
at 
org.eclipse.jetty.client.util.InputStreamResponseListener$Input.toIOException(In
putStreamResponseListener.java:343)
java.io.IOException: java.util.concurrent.TimeoutException: Total timeout 
120000 ms elapsed"

Basically, it's failing to copy big size files(150+ MB) from leader to follower 
and failing after 120000 ms.
Any thoughts on below questions would be appreciated:

  1.
Why all of a sudden the follower servers data lags behind
  2.
Is there any way to increase the 120 sec timeout?

We have already tried adding params like "-Dsolr.http.timeout" and 
"-Dsolr.indexfetcher.timeout" but nothing helped us.
Thank you,
Prafull


Best Regards,

Prafull Prakash Patil

PLM Operations, Pune

Mobile : +917032529590


********************************************************************************
******
This e-mail and any attachments thereto may contain confidential information 
and/or information protected by intellectual property rights for the exclusive 
attention of the intended addressees named above. If you have received this 
transmission in error, please immediately notify the sender by return e-mail 
and delete this message and its attachments. Unauthorized use, copying or 
further full or partial distribution of this e-mail or its contents is 
prohibited.
********************************************************************************
******


Reply via email to