Re: Async exceptions during distributed update

Jay Potharaju Mon, 14 May 2018 08:17:24 -0700

Adding some more context to my last email....
Solr:6.6.3
2 nodes : 3 shards each
No replication .
Can someone answer the following questions 
1) any ideas on why the following errors keep happening. AFAIK streaming solr 
clients error is  because of timeouts when connecting to other nodes. 
Async errors are also network related as explained earlier in the email by Emir.
There were no network issues but the error has comeback and filling up my logs. 
2) is anyone using solr 6.6.3 in production and what has their experience been 
so far.
3) is there any good documentation or blog post that would explain about inner 
working of solrcloud networking?


Thanks
Jay
org.apache.solr.update.StreamingSolrClients  
>  
> org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException:
>  Async exception during 


> On May 13, 2018, at 9:21 PM, Jay Potharaju <jspothar...@gmail.com> wrote:
> 
> Hi,
> I restarted both my solr servers but I am seeing the async error again. In 
> older 5x version of solrcloud, solr would normally recover gracefully in case 
> of network errors, but solr 6.6.3 does not seem to be doing that. At this 
> time I am not doing only a small percentage of  deletebyquery operations, its 
> mostly indexing of documents only.
> I have not noticed any network blip like last time.  Any suggestions or is 
> any else also having the same issue on solr 6.6.3?
> 
>   I am again seeing the following two errors back to back. 
> 
>  ERROR org.apache.solr.update.StreamingSolrClients  
>  
> org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException:
>  Async exception during distributed update: Read timed out
> Thanks
> Jay 
>  
> 
> 
>> On Wed, May 9, 2018 at 12:34 AM Emir Arnautović 
>> <emir.arnauto...@sematext.com> wrote:
>> Hi Jay,
>> Network blip might be the cause, but also the consequence of this issue. 
>> Maybe you can try avoiding DBQ while indexing and see if it is the cause. 
>> You can do thread dump on “the other” node and see if there are blocked 
>> threads and that can give you more clues what’s going on.
>> 
>> Thanks,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>> > On 8 May 2018, at 17:53, Jay Potharaju <jspothar...@gmail.com> wrote:
>> > 
>> > Hi Emir,
>> > I was seeing this error as long as the indexing was running. Once I stopped
>> > the indexing the errors also stopped.  Yes, we do monitor both hosts & solr
>> > but have not seen anything out of the ordinary except for a small network
>> > blip. In my experience solr generally recovers after a network blip and
>> > there are a few errors for streaming solr client...but have never seen this
>> > error before.
>> > 
>> > Thanks
>> > Jay
>> > 
>> > Thanks
>> > Jay Potharaju
>> > 
>> > 
>> > On Tue, May 8, 2018 at 12:56 AM, Emir Arnautović <
>> > emir.arnauto...@sematext.com> wrote:
>> > 
>> >> Hi Jay,
>> >> This is low ingestion rate. What is the size of your index? What is heap
>> >> size? I am guessing that this is not a huge index, so  I am leaning toward
>> >> what Shawn mentioned - some combination of DBQ/merge/commit/optimise that
>> >> is blocking indexing. Though, it is strange that it is happening only on
>> >> one node if you are sending updates randomly to both nodes. Do you monitor
>> >> your hosts/Solr? Do you see anything different at the time when timeouts
>> >> happen?
>> >> 
>> >> Thanks,
>> >> Emir
>> >> --
>> >> Monitoring - Log Management - Alerting - Anomaly Detection
>> >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> >> 
>> >> 
>> >> 
>> >>> On 8 May 2018, at 03:23, Jay Potharaju <jspothar...@gmail.com> wrote:
>> >>> 
>> >>> I have about 3-5 updates per second.
>> >>> 
>> >>> 
>> >>>> On May 7, 2018, at 5:02 PM, Shawn Heisey <apa...@elyograg.org> wrote:
>> >>>> 
>> >>>>> On 5/7/2018 5:05 PM, Jay Potharaju wrote:
>> >>>>> There are some deletes by query. I have not had any issues with DBQ,
>> >>>>> currently have 5.3 running in production.
>> >>>> 
>> >>>> Here's the big problem with DBQ.  Imagine this sequence of events with
>> >>>> these timestamps:
>> >>>> 
>> >>>> 13:00:00: A commit for change visibility happens.
>> >>>> 13:00:00: A segment merge is triggered by the commit.
>> >>>> (It's a big merge that takes exactly 3 minutes.)
>> >>>> 13:00:05: A deleteByQuery is sent.
>> >>>> 13:00:15: An update to the index is sent.
>> >>>> 13:00:25: An update to the index is sent.
>> >>>> 13:00:35: An update to the index is sent.
>> >>>> 13:00:45: An update to the index is sent.
>> >>>> 13:00:55: An update to the index is sent.
>> >>>> 13:01:05: An update to the index is sent.
>> >>>> 13:01:15: An update to the index is sent.
>> >>>> 13:01:25: An update to the index is sent.
>> >>>> {time passes, more updates might be sent}
>> >>>> 13:03:00: The merge finishes.
>> >>>> 
>> >>>> Here's what would happen in this scenario:  The DBQ and all of the
>> >>>> update requests sent *after* the DBQ will block until the merge
>> >>>> finishes.  That means that it's going to take up to three minutes for
>> >>>> Solr to respond to those requests.  If the client that is sending the
>> >>>> request is configured with a 60 second socket timeout, which inter-node
>> >>>> requests made by Solr are by default, then it is going to experience a
>> >>>> timeout error.  The request will probably complete successfully once the
>> >>>> merge finishes, but the connection is gone, and the client has already
>> >>>> received an error.
>> >>>> 
>> >>>> Now imagine what happens if an optimize (forced merge of the entire
>> >>>> index) is requested on an index that's 50GB.  That optimize may take 2-3
>> >>>> hours, possibly longer.  A deleteByQuery started on that index after the
>> >>>> optimize begins (and any updates requested after the DBQ) will pause
>> >>>> until the optimize is done.  A pause of 2 hours or more is a BIG
>> >> problem.
>> >>>> 
>> >>>> This is why deleteByQuery is not recommended.
>> >>>> 
>> >>>> If the deleteByQuery were changed into a two-step process involving a
>> >>>> query to retrieve ID values and then one or more deleteById requests,
>> >>>> then none of that blocking would occur.  The deleteById operation can
>> >>>> run at the same time as a segment merge, so neither it nor subsequent
>> >>>> update requests will have the significant pause.  From what I
>> >>>> understand, you can even do commits in this scenario and have changes be
>> >>>> visible before the merge completes.  I haven't verified that this is the
>> >>>> case.
>> >>>> 
>> >>>> Experienced devs: Can we fix this problem with DBQ?  On indexes with a
>> >>>> uniqueKey, can DBQ be changed to use the two-step process I mentioned?
>> >>>> 
>> >>>> Thanks,
>> >>>> Shawn
>> >>>> 
>> >> 
>> >> 
>>

Re: Async exceptions during distributed update

Reply via email to