Re: Async exceptions during distributed update

Jay Potharaju Sun, 13 May 2018 21:23:01 -0700

Hi,
I restarted both my solr servers but I am seeing the async error again. In
older 5x version of solrcloud, solr would normally recover gracefully in
case of network errors, but solr 6.6.3 does not seem to be doing that. At
this time I am not doing only a small percentage of  deletebyquery
operations, its mostly indexing of documents only.
I have not noticed any network blip like last time.  Any suggestions or is
any else also having the same issue on solr 6.6.3?


  I am again seeing the following two errors back to back.

 ERROR org.apache.solr.update.StreamingSolrClients

org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException:
Async exception during distributed update: Read timed out
Thanks
Jay



On Wed, May 9, 2018 at 12:34 AM Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> Hi Jay,
> Network blip might be the cause, but also the consequence of this issue.
> Maybe you can try avoiding DBQ while indexing and see if it is the cause.
> You can do thread dump on “the other” node and see if there are blocked
> threads and that can give you more clues what’s going on.
>
> Thanks,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 8 May 2018, at 17:53, Jay Potharaju <jspothar...@gmail.com> wrote:
> >
> > Hi Emir,
> > I was seeing this error as long as the indexing was running. Once I
> stopped
> > the indexing the errors also stopped.  Yes, we do monitor both hosts &
> solr
> > but have not seen anything out of the ordinary except for a small network
> > blip. In my experience solr generally recovers after a network blip and
> > there are a few errors for streaming solr client...but have never seen
> this
> > error before.
> >
> > Thanks
> > Jay
> >
> > Thanks
> > Jay Potharaju
> >
> >
> > On Tue, May 8, 2018 at 12:56 AM, Emir Arnautović <
> > emir.arnauto...@sematext.com> wrote:
> >
> >> Hi Jay,
> >> This is low ingestion rate. What is the size of your index? What is heap
> >> size? I am guessing that this is not a huge index, so  I am leaning
> toward
> >> what Shawn mentioned - some combination of DBQ/merge/commit/optimise
> that
> >> is blocking indexing. Though, it is strange that it is happening only on
> >> one node if you are sending updates randomly to both nodes. Do you
> monitor
> >> your hosts/Solr? Do you see anything different at the time when timeouts
> >> happen?
> >>
> >> Thanks,
> >> Emir
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection
> >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >>
> >>> On 8 May 2018, at 03:23, Jay Potharaju <jspothar...@gmail.com> wrote:
> >>>
> >>> I have about 3-5 updates per second.
> >>>
> >>>
> >>>> On May 7, 2018, at 5:02 PM, Shawn Heisey <apa...@elyograg.org> wrote:
> >>>>
> >>>>> On 5/7/2018 5:05 PM, Jay Potharaju wrote:
> >>>>> There are some deletes by query. I have not had any issues with DBQ,
> >>>>> currently have 5.3 running in production.
> >>>>
> >>>> Here's the big problem with DBQ.  Imagine this sequence of events with
> >>>> these timestamps:
> >>>>
> >>>> 13:00:00: A commit for change visibility happens.
> >>>> 13:00:00: A segment merge is triggered by the commit.
> >>>> (It's a big merge that takes exactly 3 minutes.)
> >>>> 13:00:05: A deleteByQuery is sent.
> >>>> 13:00:15: An update to the index is sent.
> >>>> 13:00:25: An update to the index is sent.
> >>>> 13:00:35: An update to the index is sent.
> >>>> 13:00:45: An update to the index is sent.
> >>>> 13:00:55: An update to the index is sent.
> >>>> 13:01:05: An update to the index is sent.
> >>>> 13:01:15: An update to the index is sent.
> >>>> 13:01:25: An update to the index is sent.
> >>>> {time passes, more updates might be sent}
> >>>> 13:03:00: The merge finishes.
> >>>>
> >>>> Here's what would happen in this scenario:  The DBQ and all of the
> >>>> update requests sent *after* the DBQ will block until the merge
> >>>> finishes.  That means that it's going to take up to three minutes for
> >>>> Solr to respond to those requests.  If the client that is sending the
> >>>> request is configured with a 60 second socket timeout, which
> inter-node
> >>>> requests made by Solr are by default, then it is going to experience a
> >>>> timeout error.  The request will probably complete successfully once
> the
> >>>> merge finishes, but the connection is gone, and the client has already
> >>>> received an error.
> >>>>
> >>>> Now imagine what happens if an optimize (forced merge of the entire
> >>>> index) is requested on an index that's 50GB.  That optimize may take
> 2-3
> >>>> hours, possibly longer.  A deleteByQuery started on that index after
> the
> >>>> optimize begins (and any updates requested after the DBQ) will pause
> >>>> until the optimize is done.  A pause of 2 hours or more is a BIG
> >> problem.
> >>>>
> >>>> This is why deleteByQuery is not recommended.
> >>>>
> >>>> If the deleteByQuery were changed into a two-step process involving a
> >>>> query to retrieve ID values and then one or more deleteById requests,
> >>>> then none of that blocking would occur.  The deleteById operation can
> >>>> run at the same time as a segment merge, so neither it nor subsequent
> >>>> update requests will have the significant pause.  From what I
> >>>> understand, you can even do commits in this scenario and have changes
> be
> >>>> visible before the merge completes.  I haven't verified that this is
> the
> >>>> case.
> >>>>
> >>>> Experienced devs: Can we fix this problem with DBQ?  On indexes with a
> >>>> uniqueKey, can DBQ be changed to use the two-step process I mentioned?
> >>>>
> >>>> Thanks,
> >>>> Shawn
> >>>>
> >>
> >>
>
>

Re: Async exceptions during distributed update

Reply via email to