Re: Replicas in Recovery During Atomic Updates
Hi, Anyone has any idea about this issue? Apart from the errors in the previous email, facing below errors frequently: 2020-08-19 11:56:09.467 ERROR (qtp1546693040-32) [c:collection_4 s:shard3 r:core_node13 x:collection_4_shard3_replica_n10] o.a.s.u.SolrCmdDistributor java.io.IOException: Request processing has stalled for 20017ms with 100 remaining elements in the queue. 2020-08-19 11:56:16.243 ERROR (qtp1546693040-72) [c:collection_4 s:shard3 r:core_node13 x:collection_4_shard3_replica_n10] o.a.s.h.RequestHandlerBase java.io.IOException: Task queue processing has stalled for 20216 ms with 0 remaining elements to process. 2020-08-19 11:56:22.584 ERROR (qtp1546693040-32) [c:collection_4 s:shard3 r:core_node13 x:collection_4_shard3_replica_n10] o.a.s.u.p.DistributedZkUpdateProcessor Setting up to try to start recovery on replica core_node11 with url http://x.x.x.25:8983/solr/collection_4_shard3_replica_n8/ by increasing leader term => java.io.IOException: Request processing has stalled for 20017ms with 100 remaining elements in the queue. 2020-08-19 11:56:16.064 ERROR (updateExecutor-5-thread-8-processing-null) [ ] o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling SolrCmdDistributor$Req: cmd=delete{_version_=-1675454745405292544,query=`{!cache=false}_expire_at_:[* TO 2020-08-19T11:55:47.604Z]`,commitWithin=-1}; node=ForwardNode: http://x.x.x.24:8983/solr/collection_4_shard3_replica_n10/ to http://x.x.x.24:8983/solr/collection_4_shard3_replica_n10/ => org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://x.x.x.24:8983/solr/collection_4_shard3_replica_n10/: null On Tue, Aug 11, 2020 at 2:08 AM Anshuman Singh wrote: > Just to give you an idea, this is how we are ingesting: > > {"id": 1, "field1": {"inc": 20}, "field2": {"inc": 30}, "field3": 40. > "field4": "some string"} > > We are using Solr-8.5.1. We have not configured any update processor. Hard > commit happens every minute or at 100k docs, soft commit happens every 10 > mins. > We have an external ZK setup with 5 nodes. > > Open files hard/soft limit is 65k and "max user processes" is unlimited. > > These are the different ERROR logs I found in the log files: > > ERROR (qtp1546693040-2637) [c:collection s:shard27 r:core_node109 > x:collection_shard27_replica_n106] o.a.s.s.HttpSolrCall > null:org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException: > Async exception during distributed update: java.net.ConnectException: > Connection refused > > ERROR (qtp1546693040-1136) [c:collection s:shard101 r:core_node405 > x:collection_shard101_replica_n402] o.a.s.s.HttpSolrCall > null:java.io.IOException: java.lang.InterruptedException > > ERROR (qtp1546693040-2704) [c:collection s:shard101 r:core_node405 > x:collection_shard101_replica_n402] o.a.s.s.HttpSolrCall > null:org.eclipse.jetty.io.EofException: Reset cancel_stream_error > > ERROR (qtp1546693040-1344) [c:collection s:shard20 r:core_node79 > x:collection_shard20_replica_n76] o.a.s.h.RequestHandlerBase > org.apache.solr.common.SolrException: No registered leader was found after > waiting for 4000ms , collection: collection slice: shard48 saw > state=DocCollection(collection//collections/collection/state.json/96434)={ > > ERROR (qtp1546693040-2928) [c:collection s:shard80 r:core_node319 > x:collection_shard80_replica_n316] o.a.s.h.RequestHandlerBase > org.apache.solr.common.SolrException: Request says it is coming from > leader, but we are the leader > > ERROR (updateExecutor-5-thread-47-processing-n:192.100.20.19:8985_solr > x:collection_shard161_replica_n641 c:collection s:shard161 r:core_node646) > [c:collection s:shard161 r:core_node646 x:collection_shard161_replica_n641] > o.a.s.u.SolrCmdDistributor > org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException: > Error from server at null: Expected mime type application/octet-stream but > got application/json > > ERROR (recoveryExecutor-7-thread-16-processing-n:192.100.20.33:8984_solr > x:collection_shard80_replica_n47 c:collection s:shard80 r:core_node48) > [c:collection s:shard80 r:core_node48 x:collection_shard80_replica_n47] > o.a.s.c.RecoveryStrategy Error while trying to recover. > core=collection_shard80_replica_n47:java.util.concurrent.ExecutionException: > org.apache.solr.client.solrj.SolrServerException: IOException occurred when > talking to server at: http://192.100.20.34:8984/solr > > ERROR (zkCallback-10-thread-22) [c:collection s:shard19 r:core_node322 > x:collection_shard19_replica_n321] o.a.s.c.ShardLeaderElectionContext There > was a problem trying to register as the > leader:org.apache.solr.common.AlreadyClosedException > > ERROR > (OverseerStateUpdate-176461820351853980-192.100.20.34:8985_solr-n_002357) > [ ] o.a.s.c.Overseer Overseer could not process the current clusterstate > state update message, skipping the message: { > > ERROR (main-EventThread) [ ] o.a.z.ClientCnxn Error while calling > watcher =>
Re: Replicas in Recovery During Atomic Updates
Just to give you an idea, this is how we are ingesting: {"id": 1, "field1": {"inc": 20}, "field2": {"inc": 30}, "field3": 40. "field4": "some string"} We are using Solr-8.5.1. We have not configured any update processor. Hard commit happens every minute or at 100k docs, soft commit happens every 10 mins. We have an external ZK setup with 5 nodes. Open files hard/soft limit is 65k and "max user processes" is unlimited. These are the different ERROR logs I found in the log files: ERROR (qtp1546693040-2637) [c:collection s:shard27 r:core_node109 x:collection_shard27_replica_n106] o.a.s.s.HttpSolrCall null:org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException: Async exception during distributed update: java.net.ConnectException: Connection refused ERROR (qtp1546693040-1136) [c:collection s:shard101 r:core_node405 x:collection_shard101_replica_n402] o.a.s.s.HttpSolrCall null:java.io.IOException: java.lang.InterruptedException ERROR (qtp1546693040-2704) [c:collection s:shard101 r:core_node405 x:collection_shard101_replica_n402] o.a.s.s.HttpSolrCall null:org.eclipse.jetty.io.EofException: Reset cancel_stream_error ERROR (qtp1546693040-1344) [c:collection s:shard20 r:core_node79 x:collection_shard20_replica_n76] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: No registered leader was found after waiting for 4000ms , collection: collection slice: shard48 saw state=DocCollection(collection//collections/collection/state.json/96434)={ ERROR (qtp1546693040-2928) [c:collection s:shard80 r:core_node319 x:collection_shard80_replica_n316] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Request says it is coming from leader, but we are the leader ERROR (updateExecutor-5-thread-47-processing-n:192.100.20.19:8985_solr x:collection_shard161_replica_n641 c:collection s:shard161 r:core_node646) [c:collection s:shard161 r:core_node646 x:collection_shard161_replica_n641] o.a.s.u.SolrCmdDistributor org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException: Error from server at null: Expected mime type application/octet-stream but got application/json ERROR (recoveryExecutor-7-thread-16-processing-n:192.100.20.33:8984_solr x:collection_shard80_replica_n47 c:collection s:shard80 r:core_node48) [c:collection s:shard80 r:core_node48 x:collection_shard80_replica_n47] o.a.s.c.RecoveryStrategy Error while trying to recover. core=collection_shard80_replica_n47:java.util.concurrent.ExecutionException: org.apache.solr.client.solrj.SolrServerException: IOException occurred when talking to server at: http://192.100.20.34:8984/solr ERROR (zkCallback-10-thread-22) [c:collection s:shard19 r:core_node322 x:collection_shard19_replica_n321] o.a.s.c.ShardLeaderElectionContext There was a problem trying to register as the leader:org.apache.solr.common.AlreadyClosedException ERROR (OverseerStateUpdate-176461820351853980-192.100.20.34:8985_solr-n_002357) [ ] o.a.s.c.Overseer Overseer could not process the current clusterstate state update message, skipping the message: { ERROR (main-EventThread) [ ] o.a.z.ClientCnxn Error while calling watcher => java.lang.OutOfMemoryError: unable to create new native thread ERROR (coreContainerWorkExecutor-2-thread-1-processing-n:192.100.20.34:8986_solr) [ ] o.a.s.c.CoreContainer Error waiting for SolrCore to be loaded on startup => org.apache.solr.cloud.ZkController$NotInClusterStateException: coreNodeName core_node638 does not exist in shard shard105, ignore the exception if the replica was deleted ERROR (qtp836220863-249) [c:collection s:shard162 r:core_node548 x:collection_shard162_replica_n547] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: No registered leader was found after waiting for 4000ms , collection: collection slice: shard162 saw state=DocCollection(collection//collections/collection/state.json/43121)={ Regards, Anshuman On Mon, Aug 10, 2020 at 9:19 PM Jörn Franke wrote: > How do you ingest it exactly with Atomtic updates ? Is there an update > processor in-between? > > What are your settings for hard/soft commit? > > For the shared going to recovery - do you have a log entry or something ? > > What is the Solr version? > > How do you setup ZK? > > > Am 10.08.2020 um 16:24 schrieb Anshuman Singh >: > > > > Hi, > > > > We have a SolrCloud cluster with 10 nodes. We have 6B records ingested in > > the Collection. Our use case requires atomic updates ("inc") on 5 fields. > > Now almost 90% documents are atomic updates and as soon as we start our > > ingestion pipelines, multiple shards start going into recovery, sometimes > > all replicas of some shards go into down state. > > The ingestion rate is also too slow with atomic updates, 4-5k per second. > > We were able to ingest records without atomic updates at the rate of 50k > > records per second without any issues. > > > > What I'm suspecting is, the fact that these "inc" atomic updates > > require fetching of fields
Re: Replicas in Recovery During Atomic Updates
How do you ingest it exactly with Atomtic updates ? Is there an update processor in-between? What are your settings for hard/soft commit? For the shared going to recovery - do you have a log entry or something ? What is the Solr version? How do you setup ZK? > Am 10.08.2020 um 16:24 schrieb Anshuman Singh : > > Hi, > > We have a SolrCloud cluster with 10 nodes. We have 6B records ingested in > the Collection. Our use case requires atomic updates ("inc") on 5 fields. > Now almost 90% documents are atomic updates and as soon as we start our > ingestion pipelines, multiple shards start going into recovery, sometimes > all replicas of some shards go into down state. > The ingestion rate is also too slow with atomic updates, 4-5k per second. > We were able to ingest records without atomic updates at the rate of 50k > records per second without any issues. > > What I'm suspecting is, the fact that these "inc" atomic updates > require fetching of fields before indexing can cause slow rates but what > I'm not getting is, why are the replicas going into recovery? > > Regards, > Anshuman
Re: Replicas in Recovery During Atomic Updates
Good question, what do the logs say? You’ve provided very little information to help diagnose the issue. As to your observation that atomic updates are expensive, that’s true. Under the covers, Solr has to go out and fetch the document, overlay your changes and then re-index the full document. So, indeed, it actually takes more work as far as Solr is concerned than just having the entire document re-sent by the client. I don’t know offhand what the root cause of the difference in ingestion rates is… Best, Erick > On Aug 10, 2020, at 10:23 AM, Anshuman Singh > wrote: > > Hi, > > We have a SolrCloud cluster with 10 nodes. We have 6B records ingested in > the Collection. Our use case requires atomic updates ("inc") on 5 fields. > Now almost 90% documents are atomic updates and as soon as we start our > ingestion pipelines, multiple shards start going into recovery, sometimes > all replicas of some shards go into down state. > The ingestion rate is also too slow with atomic updates, 4-5k per second. > We were able to ingest records without atomic updates at the rate of 50k > records per second without any issues. > > What I'm suspecting is, the fact that these "inc" atomic updates > require fetching of fields before indexing can cause slow rates but what > I'm not getting is, why are the replicas going into recovery? > > Regards, > Anshuman