Re: Index and tlog duplicated, searcher uses old index

Jan Høydahl Mon, 12 Jun 2023 04:17:29 -0700

There is also a new "maxSize" setting for autoCommit, that could perhaps be 
used in place of maxDocs to prevent the trans-log growing too big, see 
https://solr.apache.org/guide/solr/latest/configuration-guide/commits-transaction-logs.html#automatic-commits


I have seen the index.NNNNNNNN directories a few times before with classic 
replication, piling up and causing disk space issues. But never in solrCloud 
and never getting stale like this, so no idea what is happening here. It is 
IndexFetcher that creates these .NNN dirs as a temp directory while pulling a 
new index without overwriting the existing one. So my guess is that something 
has gone wrong during index fetching, so that the temp .NNN folder was never 
swapped in as the new "index" folder but left on disk. Watch out for exceptions 
or error logs. Worst case some exception is swallowed somewhere?

Jan

> 12. jun. 2023 kl. 10:47 skrev Nick Vladiceanu <vladicean...@gmail.com>:
> 
> the reason autoCommit.maxTime is set to 5 mins is because of 
> openSearcher=true due to the TLOG + PULL replica types. Opening a new 
> searcher too often is a costly operation, especially during high traffic. 
> When we were using NRT, the autoCommit.maxTime was set to something like 60s 
> and softCommit to 5 mins, hence migrated to TLOG + PULL, the only way to open 
> a new searcher and to “instruct” the pull replicas to pull the segments 
> periodically is via the autoCommit.maxTime (if that didn’t change, the pull 
> replicas will pull every autoCommit.maxTime / 2). 
> 
> We see some warnings regarding the “checksum didn’t match” on some nodes:
> (example)
> File _lbu.cfe did not match. expected checksum is 1075403290 and actual is 
> checksum 387703527. expected length is 542 and actual length is 542
> 
> 
> The GC activity for the past 7 days was uploaded from another node (from the 
> same cluster): 
> https://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMjMvMDYvMTIvc29scl9nYy50YXIuZ3otLTgtMzgtNDk=&channel=WEB
>  
> 
> The reload timeout is indeed weird, it never succeeds on some cores unless 
> they get re-created. It would keep timing out on 180s no matter what. This is 
> another issue since the Solr 9 upgrade that was reported 
> https://lists.apache.org/thread/vxklzw5z1qm9wdo4536mc7tcto7ov33x. Since we 
> upgraded from 9.0 to 9.1.1, the timeout would happen much less often and the 
> cluster won’t be destabilized. We lower the inter-node communication timeouts 
> and that helped to keep the cluster stable despite one node failing to reload.
> 
> We are planning to start upgrading to 9.2.1, hopefully that will solve the 
> issue.
> 
> ---
> Nick Vladiceanu
> vladicean...@gmail.com 
> 
> 
> 
> 
>> On 9. Jun 2023, at 16:33, Shawn Heisey <elyog...@elyograg.org> wrote:
>> 
>> On 6/9/23 03:05, Nick Vladiceanu wrote:
>>> autoCommit is enabled and it also has openSearcher set to true because in 
>>> TLOG / PULL replicas there is no softCommit and therefore we need to open a 
>>> new searcher during autoCommit.
>>> <autoCommit>
>>>    <maxTime>${solr.autoCommit.maxTime:300000}</maxTime>
>>>    <maxDocs>${solr.autoCommit.maxDocs:1000}</maxDocs>
>>>    <openSearcher>true</openSearcher>
>>> </autoCommit>
>>> When tried to reload the collection, the node in question (node-7) timed 
>>> out without any errors (general timeout, 180s).
>> 
>> You should probably lower the maxTime on your autoCommit.  Unless you have 
>> commits that regularly take longer than 30 seconds to complete ... and if 
>> that's the case, you might have a general performance issue and it might not 
>> be easy to solve.  I would suggest a value of 60000 for maxTime.
>> 
>> You should also remove maxDocs.  Time based autoCommit is a lot more 
>> predictable.  With a super low value of 1000 docs, the autoCommit may be 
>> firing VERY frequently, and that can be problematic.
>> 
>> With the relatively small size of your indexes, it seems very weird that a 
>> reload would take longer than 3 minutes.  I would expect it to take at most 
>> a few seconds.  I managed a sharded index (no SolrCloud) with much larger 
>> cores and reloading all the cores would only take a few seconds total.  
>> We're back to the possibility of a general performance issue.
>> 
>> I am hoping that all the fixes in 9.2.1 will help.  I did a quick glance 
>> through CHANGES.txt and nothing jumped out at me as a definite candidate for 
>> your troubles, but sometimes multiple fixes work together to fix a larger 
>> issue.
>> 
>> I wish I had something more definite to tell you.  Are you seeing any 
>> warnings or errors in solr.log?
>> 
>> Thanks,
>> Shawn
>

Re: Index and tlog duplicated, searcher uses old index

Reply via email to