thanks Shawn. autoCommit is enabled and it also has openSearcher set to true because in TLOG / PULL replicas there is no softCommit and therefore we need to open a new searcher during autoCommit.
<autoCommit> <maxTime>${solr.autoCommit.maxTime:300000}</maxTime> <maxDocs>${solr.autoCommit.maxDocs:1000}</maxDocs> <openSearcher>true</openSearcher> </autoCommit> When tried to reload the collection, the node in question (node-7) timed out without any errors (general timeout, 180s). We have multiple clusters that run similar setup (difference is the # of nodes, docs and size of the nodes), none of them ended up in such a weird state. This is a bit worrying as in bigger clusters, without proper monitoring and alerting[*], one might end up serving outdated content. We are planning to upgrade to 9.2.1 and actively monitor the state of the nodes.. [*] - which we still need to figure it out which metrics could tell us that the active index is lagging behind the leader; we got an idea though, basically, "sum(rate(solr_metrics_core_searcher_documents{namespace!=“"}[10m])) by (pod, collection)” which could give us at least some understanding of the index state on each node // careful: this will work only if you have continuous updates to the collections. Perhaps, anyone has better ideas on how to monitor the lag of active index? Thanks, --- Nick Vladiceanu vladicean...@gmail.com > On 9. Jun 2023, at 10:34, Shawn Heisey <apa...@elyograg.org> wrote: > > On 6/9/23 01:43, Nick Vladiceanu wrote: >> We noticed that we get inconsistent results for the same query if run >> multiple times. Out of 4 requests, one of them was returning empty response >> when we were running “/select?q=id:12345&distrib=true”. >> Started checking each core and we noticed that the core on node-7 had "Last >> Modified: 9 days ago” (Solr UI -> selected the core -> Overview). On the >> right side, "Instance details" were showing that we are using “Index: >> /var/solr/data/collection_0_shard2_replica_t15/data/index.20230530170400660”. >> Something is wrong. > > I suspect that you may have turned off autoCommit. If so, that's a bad idea. > > The solrconfig.xml should always have autoCommit configured with a relatively > short maxTime. The configs Solr ships with have maxTime set to 15000 > milliseconds. In most cases, the autoCommit should have openSearcher set to > false. I personally increase the maxTime to 60000 milliseconds, so there is > less overall system load, but the 15 second interval in the example configs > works very well. If it didn't, it wouldn't be in the example configs. > > TLOG and PULL followers query the leader for changes on an interval that's > half of the autoCommit maxTime setting. > > Note that autoCommit serves a very different purpose than autoSoftCommit. If > you're going to disable one of them, it should be autoSoftCommit. > > https://lucidworks.com/post/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ > > If you haven't disabled autoCommit then I don't know what might be wrong. > > You should upgrade to 9.2.1. The list of bugs fixed between 9.1 and 9.2 is > very extensive, and I have actually run into a number of them on 9.1.1. > > I have never understood what makes Solr use an "index.NNNNNNNN" directory > instead of just "index" or when it switches to a new directory. I know it > has something to do with replication, which is the Solr feature that > SolrCloud uses to copy TLOG/PULL replica data. > > If you are finding that you've got extra data in some of your replicas from > multiple index directories, just reload the collection. That should get > straightened out. > > Thanks, > Shawn