Hi Ramesh, >I heard that they will be deleted on the replica node only when a leader change occurs While I am no solr expert, this understanding is incorrect AFAIK. The replication handler internally sets up a listener for switching transaction logs on the non-leader TLOG replicas. Unless you have changed the related settings, it is usually 150s. If you want to confirm it you can turn on the relevant logs and search for a log line similar to the below during startup. o.a.s.h.ReplicationHandler Poll scheduled at an interval of 150000ms If you are interested in diving deep, please look for ReplicationHandler.setupPolling and ReplicateFromLeader.startReplication in source code.
We are also investigating a similar issue that you reported above that we experienced in our setup last week. We landed upon this jira ticket: https://issues.apache.org/jira/browse/SOLR-14679 which is on similar lines. However, we are seeing subtle differences due to which we are investigating the issue further to see if there is more than 1 root cause for such behavior. I would suggest to toggle the log levels of the below loggers in your tlogs at least. You should be able to toggle the log levels via Admin UI to avoid restarting in case you didn't mention <Configuration monitorInterval="30" status="WARN"> in your log4j2.xml when setting up solr. <AsyncLogger name="org.apache.solr.handler.IndexFetcher" level="debug"/> <AsyncLogger name="org.apache.solr.handler.ReplicationHandler" level="debug"/> <AsyncLogger name="org.apache.solr.cloud.OverseerTaskProcessor" level="info" /> <AsyncLogger name="org.apache.solr.common.cloud.ZkStateReader" level="info" /> <AsyncLogger name="org.apache.solr.update.UpdateLog" level="debug" /> <AsyncLogger name="org.apache.solr.update.TransactionLog" level="debug" /> <AsyncLogger name="org.apache.solr.cloud.ReplicateFromLeader" level="info" /> We didn't have all the above loggers turned at the state level at the time of the incident. However, with whatever we had below is the snippet of the logs pertaining to it: # Logs till when the replication worked properly 2024-02-08 06:09:35.217 INFO (indexFetcher-20-thread-1) [ ] o.a.s.h.IndexFetcher Master's generation: 442774 2024-02-08 06:09:35.217 INFO (indexFetcher-20-thread-1) [ ] o.a.s.h.IndexFetcher Master's version: 1707372496922 2024-02-08 06:09:35.217 INFO (indexFetcher-20-thread-1) [ ] o.a.s.h.IndexFetcher Slave's generation: 442773 2024-02-08 06:09:35.217 INFO (indexFetcher-20-thread-1) [ ] o.a.s.h.IndexFetcher Slave's version: 1707372196789 2024-02-08 06:09:35.217 INFO (indexFetcher-20-thread-1) [ ] o.a.s.h.IndexFetcher Starting replication process 2024-02-08 06:09:35.223 INFO (indexFetcher-20-thread-1) [ ] o.a.s.h.IndexFetcher Number of files in latest index in master: 327 2024-02-08 06:09:35.227 INFO (indexFetcher-20-thread-1) [ ] o.a.s.h.IndexFetcher Starting download (fullCopy=false) to NRTCachingDirectory(MMapDirectory@/<my-path>/<tlog-dir>/data/index. 20240208060935223 lockFactory=org.apache.lucene.store.NativeFSLockFactory @7c5f7fe8; maxCacheMB=48.0 maxMergeSizeMB=4.0) 2024-02-08 06:09:35.227 INFO (indexFetcher-20-thread-1) [ ] o.a.s.h.IndexFetcher tmpIndexDir_type : class org.apache.lucene.store.NRTCachingDirectory , MMapDirectory@<my-path>/<tlog-dir>/data/index.20240208060935223 lockFactory=org.apache.lucene.store.NativeFSLockFactory@7c5f7fe8 2024-02-08 06:09:35.778 INFO (indexFetcher-20-thread-1) [ ] o.a.s.h.IndexFetcher Bytes downloaded: 341525269, Bytes skipped downloading: 0 2024-02-08 06:09:35.778 INFO (indexFetcher-20-thread-1) [ ] o.a.s.h.IndexFetcher Total time taken for download (fullCopy=false ,bytesDownloaded=341525269) : 0 secs (null bytes/sec) to NRTCachingDirectory(MMapDirectory@/<my-path>/<tlog-dir>/data/index. 20240208060935223 lockFactory=org.apache.lucene.store.NativeFSLockFactory @7c5f7fe8; maxCacheMB=48.0 maxMergeSizeMB=4.0) ... # Logs from where we start to see the issue 2024-02-08 06:14:35.216 INFO (indexFetcher-20-thread-1) [ ] o.a.s.h.IndexFetcher Master's generation: 442775 2024-02-08 06:14:35.216 INFO (indexFetcher-20-thread-1) [ ] o.a.s.h.IndexFetcher Master's version: 1707372797037 2024-02-08 06:14:35.216 INFO (indexFetcher-20-thread-1) [ ] o.a.s.h.IndexFetcher Slave's generation: 442774 2024-02-08 06:14:35.216 INFO (indexFetcher-20-thread-1) [ ] o.a.s.h.IndexFetcher Slave's version: 1707372496922 2024-02-08 06:14:35.216 INFO (indexFetcher-20-thread-1) [ ] o.a.s.h.IndexFetcher Starting replication process 2024-02-08 06:14:35.222 INFO (indexFetcher-20-thread-1) [ ] o.a.s.h.IndexFetcher Number of files in latest index in master: 313 In the above, the indexFetcher-20-thread-1 thread completely disappeared from our logs indicating no new index fetch activity happened. And even the last index fetch activity didn't complete fully as Number of files in latest index in master was the last line printed by it and it didn't continue with downloading activity. I may be wrong here, but my hunch is on the following two lines: 1. the thread either faced an exception and thus got killed and there is no code which monitors that "some index fetcher thread" is always there. 2. the thread ended up getting blocked on some lock which it failed to acquire or due to which it made an early exit and continued to do so (i.e. the thread was still alive but since it wasn't reaching this code, we didn't see any info level logs) We don't have a process still running which has this issue so we are working our way backwards for now based on logs and graphs. If you could share some of the details around your issue and if the observations overlap it may be helpful to reach a definitive conclusion. For short-term monitoring, you may look to plot below metrics on the graph and closely monitor and take avoiding action to ensure it doesn't impact your service. 1. disk space: this graph use to shot up steadily as soon as the solr process goes into the above state. 2. old-gen space and eden space: old gen used to shot and eden space used to go down drastically making an X in the graph for us. All of the above is my understanding so far. It may be better if someone from the solr core contributors team can throw in some light here or provide guidance/hints to proceed ahead. On Fri, Feb 9, 2024 at 9:10 PM Gummadi, Ramesh <rameshchandra.gumm...@ironmountain.com.invalid> wrote: > TLOGs on TLOG replicas are growing continuously. That is the issue. They > are growing 10x the index size. > > On Fri, Feb 9, 2024 at 4:40 PM 6harat <bharat.gulati.ce...@gmail.com> > wrote: > > > Hi Ramesh, > > > > By "replication tlogs are piling up" did you mean that you are seeing > that > > the tlog file under "<solr-data>/<core-name>/data/tlog/" directory > becoming > > substantially larger (10x) than your actual index size? > > > > > > On Fri, Feb 9, 2024 at 3:48 PM Gummadi, Ramesh > > <rameshchandra.gumm...@ironmountain.com.invalid> wrote: > > > > > I have a SolrCloud set up with 4 shards 2 replicas each. Both the > leader > > > and replica are in TLOG replication ( 2 TLOG replicas per shard). In > this > > > set up the replication tlogs are piling up on the non leader node. I > > heard > > > that they will be deleted on the replica node only when a leader change > > > occurs. What is the solution to this issue. > > > In the TLOG non leader replica, will the documents be replicated > through > > > peer sync when new documents are indexed on the leader. > > > > > > [image: image.png] > > > -- > > > Ramesh Gummadi > > > > > > > > > The information contained in this email message and its attachments is > > > intended only for the private and confidential use of the recipient(s) > > > named above, unless the sender expressly agrees otherwise. Transmission > > of > > > email over the Internet is not a secure communications medium. If you > are > > > requesting or have requested the transmittal of personal data, as > defined > > > in applicable privacy laws by means of email or in an attachment to > > email, > > > you must select a more secure alternate means of transmittal that > > supports > > > your obligations to protect such personal data. > > > > > > If the reader of this message is not the intended recipient and/or you > > > have received this email in error, you must take no action based on the > > > information in this email and you are hereby notified that any > > > dissemination, misuse or copying or disclosure of this communication is > > > strictly prohibited. If you have received this communication in error, > > > please notify us immediately by email and delete the original message. > > > > > > > > > -- > > Regards > > 6harat > > > > -- > The information contained in this email message and its attachments is > intended only for the private and confidential use of the recipient(s) > named above, unless the sender expressly agrees otherwise. Transmission of > email over the Internet is not a secure communications medium. If you are > requesting or have requested the transmittal of personal data, as defined > in applicable privacy laws by means of email or in an attachment to email, > you must select a more secure alternate means of transmittal that supports > your obligations to protect such personal data. > > If the reader of this > message is not the intended recipient and/or you have received this email > in error, you must take no action based on the information in this email > and you are hereby notified that any dissemination, misuse or copying or > disclosure of this communication is strictly prohibited. If you have > received this communication in error, please notify us immediately by > email > and delete the original message. > -- Regards 6harat