Re: Replicas in Recovery During Atomic Updates

2020-08-19 Thread Anshuman Singh
Hi,

Anyone has any idea about this issue?
Apart from the errors in the previous email, facing below errors frequently:

2020-08-19 11:56:09.467 ERROR (qtp1546693040-32) [c:collection_4 s:shard3
r:core_node13 x:collection_4_shard3_replica_n10] o.a.s.u.SolrCmdDistributor
java.io.IOException: Request processing has stalled for 20017ms with 100
remaining elements in the queue.

2020-08-19 11:56:16.243 ERROR (qtp1546693040-72) [c:collection_4 s:shard3
r:core_node13 x:collection_4_shard3_replica_n10] o.a.s.h.RequestHandlerBase
java.io.IOException: Task queue processing has stalled for 20216 ms with 0
remaining elements to process.

2020-08-19 11:56:22.584 ERROR (qtp1546693040-32) [c:collection_4 s:shard3
r:core_node13 x:collection_4_shard3_replica_n10]
o.a.s.u.p.DistributedZkUpdateProcessor Setting up to try to start recovery
on replica core_node11 with url
http://x.x.x.25:8983/solr/collection_4_shard3_replica_n8/ by increasing
leader term => java.io.IOException: Request processing has stalled for
20017ms with 100 remaining elements in the queue.

2020-08-19 11:56:16.064 ERROR (updateExecutor-5-thread-8-processing-null) [
  ] o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling
SolrCmdDistributor$Req:
cmd=delete{_version_=-1675454745405292544,query=`{!cache=false}_expire_at_:[*
TO 2020-08-19T11:55:47.604Z]`,commitWithin=-1}; node=ForwardNode:
http://x.x.x.24:8983/solr/collection_4_shard3_replica_n10/ to
http://x.x.x.24:8983/solr/collection_4_shard3_replica_n10/ =>
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://x.x.x.24:8983/solr/collection_4_shard3_replica_n10/:
null


On Tue, Aug 11, 2020 at 2:08 AM Anshuman Singh 
wrote:

> Just to give you an idea, this is how we are ingesting:
>
> {"id": 1, "field1": {"inc": 20}, "field2": {"inc": 30}, "field3": 40.
> "field4": "some string"}
>
> We are using Solr-8.5.1. We have not configured any update processor. Hard
> commit happens every minute or at 100k docs, soft commit happens every 10
> mins.
> We have an external ZK setup with 5 nodes.
>
> Open files hard/soft limit is 65k and "max user processes" is unlimited.
>
> These are the different ERROR logs I found in the log files:
>
> ERROR (qtp1546693040-2637) [c:collection s:shard27 r:core_node109
> x:collection_shard27_replica_n106] o.a.s.s.HttpSolrCall
> null:org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException:
> Async exception during distributed update: java.net.ConnectException:
> Connection refused
>
> ERROR (qtp1546693040-1136) [c:collection s:shard101 r:core_node405
> x:collection_shard101_replica_n402] o.a.s.s.HttpSolrCall
> null:java.io.IOException: java.lang.InterruptedException
>
> ERROR (qtp1546693040-2704) [c:collection s:shard101 r:core_node405
> x:collection_shard101_replica_n402] o.a.s.s.HttpSolrCall
> null:org.eclipse.jetty.io.EofException: Reset cancel_stream_error
>
> ERROR (qtp1546693040-1344) [c:collection s:shard20 r:core_node79
> x:collection_shard20_replica_n76] o.a.s.h.RequestHandlerBase
> org.apache.solr.common.SolrException: No registered leader was found after
> waiting for 4000ms , collection: collection slice: shard48 saw
> state=DocCollection(collection//collections/collection/state.json/96434)={
>
> ERROR (qtp1546693040-2928) [c:collection s:shard80 r:core_node319
> x:collection_shard80_replica_n316] o.a.s.h.RequestHandlerBase
> org.apache.solr.common.SolrException: Request says it is coming from
> leader, but we are the leader
>
> ERROR (updateExecutor-5-thread-47-processing-n:192.100.20.19:8985_solr
> x:collection_shard161_replica_n641 c:collection s:shard161 r:core_node646)
> [c:collection s:shard161 r:core_node646 x:collection_shard161_replica_n641]
> o.a.s.u.SolrCmdDistributor
> org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException:
> Error from server at null: Expected mime type application/octet-stream but
> got application/json
>
> ERROR (recoveryExecutor-7-thread-16-processing-n:192.100.20.33:8984_solr
> x:collection_shard80_replica_n47 c:collection s:shard80 r:core_node48)
> [c:collection s:shard80 r:core_node48 x:collection_shard80_replica_n47]
> o.a.s.c.RecoveryStrategy Error while trying to recover.
> core=collection_shard80_replica_n47:java.util.concurrent.ExecutionException:
> org.apache.solr.client.solrj.SolrServerException: IOException occurred when
> talking to server at: http://192.100.20.34:8984/solr
>
> ERROR (zkCallback-10-thread-22) [c:collection s:shard19 r:core_node322
> x:collection_shard19_replica_n321] o.a.s.c.ShardLeaderElectionContext There
> was a problem trying to register as the
> leader:org.apache.solr.common.AlreadyClosedException
>
> ERROR
> (OverseerStateUpdate-176461820351853980-192.100.20.34:8985_solr-n_002357)
> [   ] o.a.s.c.Overseer Overseer could not process the current clusterstate
> state update message, skipping the message: {
>
> ERROR (main-EventThread) [   ] o.a.z.ClientCnxn Error while calling
> watcher  => 

Re: Replicas in Recovery During Atomic Updates

2020-08-10 Thread Anshuman Singh
Just to give you an idea, this is how we are ingesting:

{"id": 1, "field1": {"inc": 20}, "field2": {"inc": 30}, "field3": 40.
"field4": "some string"}

We are using Solr-8.5.1. We have not configured any update processor. Hard
commit happens every minute or at 100k docs, soft commit happens every 10
mins.
We have an external ZK setup with 5 nodes.

Open files hard/soft limit is 65k and "max user processes" is unlimited.

These are the different ERROR logs I found in the log files:

ERROR (qtp1546693040-2637) [c:collection s:shard27 r:core_node109
x:collection_shard27_replica_n106] o.a.s.s.HttpSolrCall
null:org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException:
Async exception during distributed update: java.net.ConnectException:
Connection refused

ERROR (qtp1546693040-1136) [c:collection s:shard101 r:core_node405
x:collection_shard101_replica_n402] o.a.s.s.HttpSolrCall
null:java.io.IOException: java.lang.InterruptedException

ERROR (qtp1546693040-2704) [c:collection s:shard101 r:core_node405
x:collection_shard101_replica_n402] o.a.s.s.HttpSolrCall
null:org.eclipse.jetty.io.EofException: Reset cancel_stream_error

ERROR (qtp1546693040-1344) [c:collection s:shard20 r:core_node79
x:collection_shard20_replica_n76] o.a.s.h.RequestHandlerBase
org.apache.solr.common.SolrException: No registered leader was found after
waiting for 4000ms , collection: collection slice: shard48 saw
state=DocCollection(collection//collections/collection/state.json/96434)={

ERROR (qtp1546693040-2928) [c:collection s:shard80 r:core_node319
x:collection_shard80_replica_n316] o.a.s.h.RequestHandlerBase
org.apache.solr.common.SolrException: Request says it is coming from
leader, but we are the leader

ERROR (updateExecutor-5-thread-47-processing-n:192.100.20.19:8985_solr
x:collection_shard161_replica_n641 c:collection s:shard161 r:core_node646)
[c:collection s:shard161 r:core_node646 x:collection_shard161_replica_n641]
o.a.s.u.SolrCmdDistributor
org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException:
Error from server at null: Expected mime type application/octet-stream but
got application/json

ERROR (recoveryExecutor-7-thread-16-processing-n:192.100.20.33:8984_solr
x:collection_shard80_replica_n47 c:collection s:shard80 r:core_node48)
[c:collection s:shard80 r:core_node48 x:collection_shard80_replica_n47]
o.a.s.c.RecoveryStrategy Error while trying to recover.
core=collection_shard80_replica_n47:java.util.concurrent.ExecutionException:
org.apache.solr.client.solrj.SolrServerException: IOException occurred when
talking to server at: http://192.100.20.34:8984/solr

ERROR (zkCallback-10-thread-22) [c:collection s:shard19 r:core_node322
x:collection_shard19_replica_n321] o.a.s.c.ShardLeaderElectionContext There
was a problem trying to register as the
leader:org.apache.solr.common.AlreadyClosedException

ERROR
(OverseerStateUpdate-176461820351853980-192.100.20.34:8985_solr-n_002357)
[   ] o.a.s.c.Overseer Overseer could not process the current clusterstate
state update message, skipping the message: {

ERROR (main-EventThread) [   ] o.a.z.ClientCnxn Error while calling watcher
 => java.lang.OutOfMemoryError: unable to create new native thread

ERROR 
(coreContainerWorkExecutor-2-thread-1-processing-n:192.100.20.34:8986_solr)
[   ] o.a.s.c.CoreContainer Error waiting for SolrCore to be loaded on
startup => org.apache.solr.cloud.ZkController$NotInClusterStateException:
coreNodeName core_node638 does not exist in shard shard105, ignore the
exception if the replica was deleted

ERROR (qtp836220863-249) [c:collection s:shard162 r:core_node548
x:collection_shard162_replica_n547] o.a.s.h.RequestHandlerBase
org.apache.solr.common.SolrException: No registered leader was found after
waiting for 4000ms , collection: collection slice: shard162 saw
state=DocCollection(collection//collections/collection/state.json/43121)={

Regards,
Anshuman

On Mon, Aug 10, 2020 at 9:19 PM Jörn Franke  wrote:

> How do you ingest it exactly with Atomtic updates ? Is there an update
> processor in-between?
>
> What are your settings for hard/soft commit?
>
> For the shared going to recovery - do you have a log entry or something ?
>
> What is the Solr version?
>
> How do you setup ZK?
>
> > Am 10.08.2020 um 16:24 schrieb Anshuman Singh  >:
> >
> > Hi,
> >
> > We have a SolrCloud cluster with 10 nodes. We have 6B records ingested in
> > the Collection. Our use case requires atomic updates ("inc") on 5 fields.
> > Now almost 90% documents are atomic updates and as soon as we start our
> > ingestion pipelines, multiple shards start going into recovery, sometimes
> > all replicas of some shards go into down state.
> > The ingestion rate is also too slow with atomic updates, 4-5k per second.
> > We were able to ingest records without atomic updates at the rate of 50k
> > records per second without any issues.
> >
> > What I'm suspecting is, the fact that these "inc" atomic updates
> > require fetching of fields 

Re: Replicas in Recovery During Atomic Updates

2020-08-10 Thread Jörn Franke
How do you ingest it exactly with Atomtic updates ? Is there an update 
processor in-between? 

What are your settings for hard/soft commit?

For the shared going to recovery - do you have a log entry or something ?

What is the Solr version?

How do you setup ZK?

> Am 10.08.2020 um 16:24 schrieb Anshuman Singh :
> 
> Hi,
> 
> We have a SolrCloud cluster with 10 nodes. We have 6B records ingested in
> the Collection. Our use case requires atomic updates ("inc") on 5 fields.
> Now almost 90% documents are atomic updates and as soon as we start our
> ingestion pipelines, multiple shards start going into recovery, sometimes
> all replicas of some shards go into down state.
> The ingestion rate is also too slow with atomic updates, 4-5k per second.
> We were able to ingest records without atomic updates at the rate of 50k
> records per second without any issues.
> 
> What I'm suspecting is, the fact that these "inc" atomic updates
> require fetching of fields before indexing can cause slow rates but what
> I'm not getting is, why are the replicas going into recovery?
> 
> Regards,
> Anshuman


Re: Replicas in Recovery During Atomic Updates

2020-08-10 Thread Erick Erickson
Good question, what do the logs say? You’ve provided very little information
to help diagnose the issue.

As to your observation that atomic updates are expensive, that’s true. Under
the covers, Solr has to go out and fetch the document, overlay your changes
and then re-index the full document. So, indeed, it actually takes more work
as far as Solr is concerned than just having the entire document re-sent by
the client.

I don’t know offhand what the root cause of the difference in ingestion rates
is…

Best,
Erick

> On Aug 10, 2020, at 10:23 AM, Anshuman Singh  
> wrote:
> 
> Hi,
> 
> We have a SolrCloud cluster with 10 nodes. We have 6B records ingested in
> the Collection. Our use case requires atomic updates ("inc") on 5 fields.
> Now almost 90% documents are atomic updates and as soon as we start our
> ingestion pipelines, multiple shards start going into recovery, sometimes
> all replicas of some shards go into down state.
> The ingestion rate is also too slow with atomic updates, 4-5k per second.
> We were able to ingest records without atomic updates at the rate of 50k
> records per second without any issues.
> 
> What I'm suspecting is, the fact that these "inc" atomic updates
> require fetching of fields before indexing can cause slow rates but what
> I'm not getting is, why are the replicas going into recovery?
> 
> Regards,
> Anshuman