thanks Shawn.

autoCommit is enabled and it also has openSearcher set to true because in TLOG 
/ PULL replicas there is no softCommit and therefore we need to open a new 
searcher during autoCommit.

 <autoCommit>          
    <maxTime>${solr.autoCommit.maxTime:300000}</maxTime> 
    <maxDocs>${solr.autoCommit.maxDocs:1000}</maxDocs>
    <openSearcher>true</openSearcher>
</autoCommit>

When tried to reload the collection, the node in question (node-7) timed out 
without any errors (general timeout, 180s). 

We have multiple clusters that run similar setup (difference is the # of nodes, 
docs and size of the nodes), none of them ended up in such a weird state. 

This is a bit worrying as in bigger clusters, without proper monitoring and 
alerting[*], one might end up serving outdated content.

We are planning to upgrade to 9.2.1 and actively monitor the state of the 
nodes..

[*] - which we still need to figure it out which metrics could tell us that the 
active index is lagging behind the leader; we got an idea though, basically, 
"sum(rate(solr_metrics_core_searcher_documents{namespace!=“"}[10m])) by (pod, 
collection)” which could give us at least some understanding of the index state 
on each node // careful: this will work only if you have continuous updates to 
the collections. Perhaps, anyone has better ideas on how to monitor the lag of 
active index?

Thanks,
---
Nick Vladiceanu
vladicean...@gmail.com 




> On 9. Jun 2023, at 10:34, Shawn Heisey <apa...@elyograg.org> wrote:
> 
> On 6/9/23 01:43, Nick Vladiceanu wrote:
>> We noticed that we get inconsistent results for the same query if run 
>> multiple times. Out of 4 requests, one of them was returning empty response 
>> when we were running “/select?q=id:12345&distrib=true”.
>> Started checking each core and we noticed that the core on node-7 had "Last 
>> Modified: 9 days ago” (Solr UI -> selected the core -> Overview). On the 
>> right side, "Instance details" were showing that we are using “Index: 
>> /var/solr/data/collection_0_shard2_replica_t15/data/index.20230530170400660”.
>>  Something is wrong.
> 
> I suspect that you may have turned off autoCommit.  If so, that's a bad idea.
> 
> The solrconfig.xml should always have autoCommit configured with a relatively 
> short maxTime.  The configs Solr ships with have maxTime set to 15000 
> milliseconds.  In most cases, the autoCommit should have openSearcher set to 
> false.  I personally increase the maxTime to 60000 milliseconds, so there is 
> less overall system load, but the 15 second interval in the example configs 
> works very well.  If it didn't, it wouldn't be in the example configs.
> 
> TLOG and PULL followers query the leader for changes on an interval that's 
> half of the autoCommit maxTime setting.
> 
> Note that autoCommit serves a very different purpose than autoSoftCommit.  If 
> you're going to disable one of them, it should be autoSoftCommit.
> 
> https://lucidworks.com/post/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> 
> If you haven't disabled autoCommit then I don't know what might be wrong.
> 
> You should upgrade to 9.2.1.  The list of bugs fixed between 9.1 and 9.2 is 
> very extensive, and I have actually run into a number of them on 9.1.1.
> 
> I have never understood what makes Solr use an "index.NNNNNNNN" directory 
> instead of just "index" or when it switches to a new directory.  I know it 
> has something to do with replication, which is the Solr feature that 
> SolrCloud uses to copy TLOG/PULL replica data.
> 
> If you are finding that you've got extra data in some of your replicas from 
> multiple index directories, just reload the collection.  That should get 
> straightened out.
> 
> Thanks,
> Shawn

Reply via email to