Emir, 'delete_by_query' is the cause for the replicas going to recover state. I replaced it with delete_by_id as you suggested. Everything works fine after that. The cluster held for nearly 3 hours without any failures. Thanks Emir.
On Wed, Jan 3, 2018 at 8:41 PM, Emir Arnautović < emir.arnauto...@sematext.com> wrote: > Hi Sravan, > DBQ does not play well with indexing - it causes indexing to be completely > blocked on replicas while it is running. It is highly likely that it is the > root cause of your issues. If you can change indexing logic to avoid it, > you can quickly test it. What you can do as a workaround is to query for > IDs that needs to be deleted and execute bulk delete by ID - that will not > cause issues as DBQ. > > HTH, > Emir > -- > Monitoring - Log Management - Alerting - Anomaly Detection > Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > > > > On 3 Jan 2018, at 16:04, Sravan Kumar <sra...@caavo.com> wrote: > > > > Emir, > > Yes there is a delete_by_query on every bulk insert. > > This delete_by_query deletes all the documents which are updated > lesser > > than a day before the current time. > > Is bulk delete_by_query the reason? > > > > On Wed, Jan 3, 2018 at 7:58 PM, Emir Arnautović < > > emir.arnauto...@sematext.com> wrote: > > > >> Do you have deletes by query while indexing or it is append only index? > >> > >> Regards, > >> Emir > >> -- > >> Monitoring - Log Management - Alerting - Anomaly Detection > >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > >> > >> > >> > >>> On 3 Jan 2018, at 12:16, sravan <sra...@caavo.com> wrote: > >>> > >>> SolrCloud Nodes going to recovery state during indexing > >>> > >>> > >>> We have solr cloud setup with the settings shared below. We have a > >> collection with 3 shards and a replica for each of them. > >>> > >>> Normal State(As soon as the whole cluster is restarted): > >>> - Status of all the shards is UP. > >>> - a bulk update request of 50 documents each takes < 100ms. > >>> - 6-10 simultaneous bulk updates. > >>> > >>> Nodes going to recover state after updates for 15-30 mins. > >>> - Some shards starts giving the following ERRORs: > >>> - o.a.s.h.RequestHandlerBase org.apache.solr.update.processor. > >> DistributedUpdateProcessor$DistributedUpdatesAsyncException: Async > >> exception during distributed update: Read timed out > >>> - o.a.s.u.StreamingSolrClients error java.net. > SocketTimeoutException: > >> Read timed out > >>> - the following error is seen on the shard which goes to recovery > >> state. > >>> - too many updates received since start - startingUpdates no > >> longer overlaps with our currentUpdates. > >>> - Sometimes, the same shard even goes to DOWN state and needs a node > >> restart to come back. > >>> - a bulk update request of 50 documents takes more than 5 seconds. > >> Sometimes even >120 secs. This is seen for all the requests if at least > one > >> node is in recovery state in the whole cluster. > >>> > >>> We have a standalone setup with the same collection schema which is > able > >> to take update & query load without any errors. > >>> > >>> > >>> We have the following solrcloud setup. > >>> - setup in AWS. > >>> > >>> - Zookeeper Setup: > >>> - number of nodes: 3 > >>> - aws instance type: t2.small > >>> - instance memory: 2gb > >>> > >>> - Solr Setup: > >>> - Solr version: 6.6.0 > >>> - number of nodes: 3 > >>> - aws instance type: m5.xlarge > >>> - instance memory: 16gb > >>> - number of cores: 4 > >>> - JAVA HEAP: 8gb > >>> - JAVA VERSION: oracle java version "1.8.0_151" > >>> - GC settings: default CMS. > >>> > >>> collection settings: > >>> - number of shards: 3 > >>> - replication factor: 2 > >>> - total 6 replicas. > >>> - total number of documents in the collection: 12 million > >>> - total number of documents in each shard: 4 million > >>> - Each document has around 25 fields with 12 of them > >> containing textual analysers & filters. > >>> - Commit Strategy: > >>> - No explicit commits from application code. > >>> - Hard commit of 15 secs with OpenSearcher as false. > >>> - Soft commit of 10 mins. > >>> - Cache Strategy: > >>> - filter queries > >>> - number: 512 > >>> - autowarmCount: 100 > >>> - all other caches > >>> - number: 512 > >>> - autowarmCount: 0 > >>> - maxWarmingSearchers: 2 > >>> > >>> > >>> - We tried the following > >>> - commit strategy > >>> - hard commit - 150 secs > >>> - soft commit - 5 mins > >>> - with GCG1 garbage collector based on > https://wiki.apache.org/solr/ > >> ShawnHeisey#Java_8_recommendation_for_Solr: > >>> - the nodes go to recover state in less than a minute. > >>> > >>> The issue is seen even when the leaders are balanced across the three > >> nodes. > >>> > >>> Can you help us find the soluttion to this problem? > >> > >> > > > > > > -- > > Regards, > > Sravan > > -- Regards, Sravan