Hi,

We  have a solrCloud setup in Kubernetes with 2 Solr instances and 3
ZooKeeper instances with 1 shard. It is configured with 8G persistent
storage for each Solr and Zookeeper. The Memory allocated for Solr is 16G
with 10G Heap size. There are a max of 2.5million records indexed. There
scheduler client which will call the Solr with url -
/update/json?wt=json&commit=true - to do the add/update/delete operations.
Occasionally there will be a huge update/delete happens with 1 million
records which will call the api (/update/json?wt=json&commit=true ) with
500 documents at a time, but this is called in multiple threads. Everything
works fine 1 week, but suddenly we saw errors in Solr.log which makes the
solr in an error state and I had to restart one of the solr node.

*The errors noticed  are:*

*Node 1:*

021-04-09 08:20:56.657 ERROR
(updateExecutor-5-thread-169-processing-x:datacore_shard1_replica_n1
r:core_node3 null n:solr-1.solrcluster:8983_solr c:datacore s:shard1)
[c:datacore s:shard1 r:core_node3 x:datacore_shard1_replica_n1]
o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling
SolrCmdDistributor$Req:
cmd=add{,id=S-170262-P-108028200-F-800001737-E-180905508}; node=ForwardNode:
 http://solr-0.solrcluster:8983/solr/datacore_shard1_replica_n2/ to
http://solr-0.solrcluster:8983/solr/datacore_shard1_replica_n2/ =>
java.io.IOException: java.io.IOException: cancel_stream_error at
org.eclipse.jetty.client.util.DeferredContentProvider.flush(DeferredContentProvider.java:193)
java.io.IOException: java.io.IOException: cancel_stream_error at
org.eclipse.jetty.client.util.DeferredContentProvider.flush(DeferredContentProvider.java:193)
~[?:?]

*Node2:*

2021-04-09 08:22:56.661 INFO (qtp1632497828-35124) [c:datacore s:shard1
r:core_node4 x:datacore_shard1_replica_n2]
o.a.s.u.p.LogUpdateProcessorFactory [datacore_shard1_replica_n2]
webapp=/solr path=/update params={update.distrib=TOLEADER&distrib.from=
http://solr-1.solrcluster:8983/solr/datacore_shard1_replica_n1/&wt=javabin&version=2}{}
0 119999 2021-04-09 08:22:56.661 ERROR (qtp1632497828-35124) [c:datacore
s:shard1 r:core_node4 x:datacore_shard1_replica_n2]
o.a.s.h.RequestHandlerBase java.io.IOException:
java.util.concurrent.TimeoutException: Idle timeout expired: 120000/120000
ms at
org.eclipse.jetty.server.HttpInput$ErrorState.noContent(HttpInput.java:1085)
at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:318)

*And on both nodes we can see the below error as well -*

2021-04-09 08:21:00.812 INFO (qtp1632497828-35036) [c:datacore s:shard1
r:core_node4 x:datacore_shard1_replica_n2]
o.a.s.u.p.LogUpdateProcessorFactory [datacore_shard1_replica_n2]
webapp=/solr path=/update params={update.distrib=TOLEADER&distrib.from=
http://solr-1.solrcluster:8983/solr/datacore_shard1_replica_n1/&wt=javabin&version=2}{}
0 120770 2021-04-09 08:21:00.812 ERROR (qtp1632497828-35036) [c:datacore
s:shard1 r:core_node4 x:datacore_shard1_replica_n2]
o.a.s.h.RequestHandlerBase java.io.IOException: Task queue processing has
stalled for 90013 ms with 0 remaining elements to process. at
org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient.blockUntilFinished(ConcurrentUpdateHttp2SolrClient.java:501)

The stall time is set at 90000ms.

Why are we getting these errors?

Why is it stalling for long?

We have the average doc size of 1Kb. How can we resolve this problem?


Kindle help us.. Its urgent to resolve these issues.


Thanks,

Rekha

Reply via email to