Hi everyone,

we have experienced a very strange situation with our Solr 9.1.1 that is 
running on the official docker image on Kubernetes. 

Context:
Total: 3 collections, 8 nodes;
All replicas types: TLOG; directory factory: MMapDirectoryFactory;
2 collections have only one shard, they are not actively updated, very tiny 
(<20MB), legacy (collection_1 and collection_2), we can skip focusing on them;
1 collection has two shards, each shard has 4 replicas; collection_0;
each node has only one replica (core) of the collection_0 and one replica of 
the other two collections; total, a node would have max 3 cores hosted on it;
total size for collection_0 is ~6-7GB; since it’s split into two shards, each 
shard would have ~3GB index size.

We noticed that we get inconsistent results for the same query if run multiple 
times. Out of 4 requests, one of them was returning empty response when we were 
running “/select?q=id:12345&distrib=true”. 

Started checking each core and we noticed that the core on node-7 had "Last 
Modified: 9 days ago” (Solr UI -> selected the core -> Overview). On the right 
side, "Instance details" were showing that we are using “Index: 
/var/solr/data/collection_0_shard2_replica_t15/data/index.20230530170400660”. 
Something is wrong.

We "kubectl exec” and started to check the “/var/solr/data” folder. All the 
commands executed and results are posted here: https://justpaste.it/47jp4. 

What we saw there is that there were two indices and two transaction logs. The 
old index "index.20230530170400660” had last update on May 30, 17:04, the new 
index "index.20230530113934434” was constantly updated, however, the size of 
the new index was only growing, reaching the size of ~50GB. “index.properties” 
was pointing to the old index "index.20230530170400660”, whereas 
“replication.properties” to the new index "index.20230530113934434”.

solr@node-7:/var/solr/data/collection_0_shard2_replica_t15/data$ du -h 
index.20230530113934434  
56G index.20230530113934434     

solr@node-7:/var/solr/data/collection_0_shard2_replica_t15/data$ du -h 
index.20230530170400660  
3.1G index.20230530170400660

Every time a new searcher was opened, it was using the old, outdated index, 
hence, the reason why we were seeing the inconsistent results. We couldn’t find 
any errors in the logs of the node, only some warning related to the checksums 
of the segments not matching and retrying to pull the segments.

Could anyone please help with ideas on how to further debug to find the reason 
and how to better monitor/prevent this from happening in the future? What could 
be the cause of a new index being created, and why the aliases weren’t switched 
to use the new index? Are there any metrics that could tell us the the time 
since last replication from leader of the active index? 

Thanks in advance.

Best regards,
---
Nick Vladiceanu
vladicean...@gmail.com 




Reply via email to