Hi everyone, we have experienced a very strange situation with our Solr 9.1.1 that is running on the official docker image on Kubernetes.
Context: Total: 3 collections, 8 nodes; All replicas types: TLOG; directory factory: MMapDirectoryFactory; 2 collections have only one shard, they are not actively updated, very tiny (<20MB), legacy (collection_1 and collection_2), we can skip focusing on them; 1 collection has two shards, each shard has 4 replicas; collection_0; each node has only one replica (core) of the collection_0 and one replica of the other two collections; total, a node would have max 3 cores hosted on it; total size for collection_0 is ~6-7GB; since it’s split into two shards, each shard would have ~3GB index size. We noticed that we get inconsistent results for the same query if run multiple times. Out of 4 requests, one of them was returning empty response when we were running “/select?q=id:12345&distrib=true”. Started checking each core and we noticed that the core on node-7 had "Last Modified: 9 days ago” (Solr UI -> selected the core -> Overview). On the right side, "Instance details" were showing that we are using “Index: /var/solr/data/collection_0_shard2_replica_t15/data/index.20230530170400660”. Something is wrong. We "kubectl exec” and started to check the “/var/solr/data” folder. All the commands executed and results are posted here: https://justpaste.it/47jp4. What we saw there is that there were two indices and two transaction logs. The old index "index.20230530170400660” had last update on May 30, 17:04, the new index "index.20230530113934434” was constantly updated, however, the size of the new index was only growing, reaching the size of ~50GB. “index.properties” was pointing to the old index "index.20230530170400660”, whereas “replication.properties” to the new index "index.20230530113934434”. solr@node-7:/var/solr/data/collection_0_shard2_replica_t15/data$ du -h index.20230530113934434 56G index.20230530113934434 solr@node-7:/var/solr/data/collection_0_shard2_replica_t15/data$ du -h index.20230530170400660 3.1G index.20230530170400660 Every time a new searcher was opened, it was using the old, outdated index, hence, the reason why we were seeing the inconsistent results. We couldn’t find any errors in the logs of the node, only some warning related to the checksums of the segments not matching and retrying to pull the segments. Could anyone please help with ideas on how to further debug to find the reason and how to better monitor/prevent this from happening in the future? What could be the cause of a new index being created, and why the aliases weren’t switched to use the new index? Are there any metrics that could tell us the the time since last replication from leader of the active index? Thanks in advance. Best regards, --- Nick Vladiceanu vladicean...@gmail.com