Emir,
   'delete_by_query' is the cause for the replicas going to recover state.
   I replaced it with delete_by_id as you suggested. Everything works fine
after that. The cluster held for nearly 3 hours without any failures.
  Thanks Emir.


On Wed, Jan 3, 2018 at 8:41 PM, Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> Hi Sravan,
> DBQ does not play well with indexing - it causes indexing to be completely
> blocked on replicas while it is running. It is highly likely that it is the
> root cause of your issues. If you can change indexing logic to avoid it,
> you can quickly test it. What you can do as a workaround is to query for
> IDs that needs to be deleted and execute bulk delete by ID - that will not
> cause issues as DBQ.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 3 Jan 2018, at 16:04, Sravan Kumar <sra...@caavo.com> wrote:
> >
> > Emir,
> >    Yes there is a delete_by_query on every bulk insert.
> >    This delete_by_query deletes all the documents which are updated
> lesser
> > than a day before the current time.
> >    Is bulk delete_by_query the reason?
> >
> > On Wed, Jan 3, 2018 at 7:58 PM, Emir Arnautović <
> > emir.arnauto...@sematext.com> wrote:
> >
> >> Do you have deletes by query while indexing or it is append only index?
> >>
> >> Regards,
> >> Emir
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection
> >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >>
> >>> On 3 Jan 2018, at 12:16, sravan <sra...@caavo.com> wrote:
> >>>
> >>> SolrCloud Nodes going to recovery state during indexing
> >>>
> >>>
> >>> We have solr cloud setup with the settings shared below. We have a
> >> collection with 3 shards and a replica for each of them.
> >>>
> >>> Normal State(As soon as the whole cluster is restarted):
> >>>    - Status of all the shards is UP.
> >>>    - a bulk update request of 50 documents each takes < 100ms.
> >>>    - 6-10 simultaneous bulk updates.
> >>>
> >>> Nodes going to recover state after updates for 15-30 mins.
> >>>    - Some shards starts giving the following ERRORs:
> >>>        - o.a.s.h.RequestHandlerBase org.apache.solr.update.processor.
> >> DistributedUpdateProcessor$DistributedUpdatesAsyncException: Async
> >> exception during distributed update: Read timed out
> >>>        - o.a.s.u.StreamingSolrClients error java.net.
> SocketTimeoutException:
> >> Read timed out
> >>>    - the following error is seen on the shard which goes to recovery
> >> state.
> >>>        - too many updates received since start - startingUpdates no
> >> longer overlaps with our currentUpdates.
> >>>    - Sometimes, the same shard even goes to DOWN state and needs a node
> >> restart to come back.
> >>>    - a bulk update request of 50 documents takes more than 5 seconds.
> >> Sometimes even >120 secs. This is seen for all the requests if at least
> one
> >> node is in recovery state in the whole cluster.
> >>>
> >>> We have a standalone setup with the same collection schema which is
> able
> >> to take update & query load without any errors.
> >>>
> >>>
> >>> We have the following solrcloud setup.
> >>>    - setup in AWS.
> >>>
> >>>    - Zookeeper Setup:
> >>>        - number of nodes: 3
> >>>        - aws instance type: t2.small
> >>>        - instance memory: 2gb
> >>>
> >>>    - Solr Setup:
> >>>        - Solr version: 6.6.0
> >>>        - number of nodes: 3
> >>>        - aws instance type: m5.xlarge
> >>>        - instance memory: 16gb
> >>>        - number of cores: 4
> >>>        - JAVA HEAP: 8gb
> >>>        - JAVA VERSION: oracle java version "1.8.0_151"
> >>>        - GC settings: default CMS.
> >>>
> >>>        collection settings:
> >>>            - number of shards: 3
> >>>            - replication factor: 2
> >>>            - total 6 replicas.
> >>>            - total number of documents in the collection: 12 million
> >>>            - total number of documents in each shard: 4 million
> >>>            - Each document has around 25 fields with 12 of them
> >> containing textual analysers & filters.
> >>>            - Commit Strategy:
> >>>                - No explicit commits from application code.
> >>>                - Hard commit of 15 secs with OpenSearcher as false.
> >>>                - Soft commit of 10 mins.
> >>>            - Cache Strategy:
> >>>                - filter queries
> >>>                    - number: 512
> >>>                    - autowarmCount: 100
> >>>                - all other caches
> >>>                    - number: 512
> >>>                    - autowarmCount: 0
> >>>            - maxWarmingSearchers: 2
> >>>
> >>>
> >>> - We tried the following
> >>>    - commit strategy
> >>>        - hard commit - 150 secs
> >>>        - soft commit - 5 mins
> >>>    - with GCG1 garbage collector based on
> https://wiki.apache.org/solr/
> >> ShawnHeisey#Java_8_recommendation_for_Solr:
> >>>        - the nodes go to recover state in less than a minute.
> >>>
> >>> The issue is seen even when the leaders are balanced across the three
> >> nodes.
> >>>
> >>> Can you help us find the soluttion to this problem?
> >>
> >>
> >
> >
> > --
> > Regards,
> > Sravan
>
>


-- 
Regards,
Sravan

Reply via email to