Re: tlogs not getting purged in TLOG replica set up

6harat Tue, 13 Feb 2024 04:53:31 -0800

Hi Ramesh,

>I heard that they will be deleted on the replica node only when a leader
change occurs
While I am no solr expert, this understanding is incorrect AFAIK. The
replication handler internally sets up a listener for switching transaction
logs on the non-leader TLOG replicas. Unless you have changed the related
settings, it is usually 150s. If you want to confirm it you can turn on the
relevant logs and search for a log line similar to the below during startup.
o.a.s.h.ReplicationHandler Poll scheduled at an interval of 150000ms
If you are interested in diving deep, please look for
ReplicationHandler.setupPolling and ReplicateFromLeader.startReplication in
source code.


We are also investigating a similar issue that you reported above that we
experienced in our setup last week. We landed upon this jira ticket:
https://issues.apache.org/jira/browse/SOLR-14679 which is on similar lines.
However, we are seeing subtle differences due to which we are investigating
the issue further to see if there is more than 1 root cause for such
behavior.

I would suggest to toggle the log levels of the below loggers in your tlogs
at least. You should be able to toggle the log levels via Admin UI to avoid
restarting in case you didn't mention <Configuration monitorInterval="30"
status="WARN"> in your log4j2.xml when setting up solr.

<AsyncLogger name="org.apache.solr.handler.IndexFetcher" level="debug"/>
<AsyncLogger name="org.apache.solr.handler.ReplicationHandler"
level="debug"/>
<AsyncLogger name="org.apache.solr.cloud.OverseerTaskProcessor"
level="info" />
<AsyncLogger name="org.apache.solr.common.cloud.ZkStateReader" level="info"
/>
<AsyncLogger name="org.apache.solr.update.UpdateLog" level="debug" />
<AsyncLogger name="org.apache.solr.update.TransactionLog" level="debug" />
<AsyncLogger name="org.apache.solr.cloud.ReplicateFromLeader" level="info"
/>

We didn't have all the above loggers turned at the state level at the time
of the incident. However, with whatever we had below is the snippet of the
logs pertaining to it:

# Logs till when the replication worked properly
2024-02-08 06:09:35.217 INFO  (indexFetcher-20-thread-1) [   ]
o.a.s.h.IndexFetcher Master's generation: 442774
2024-02-08 06:09:35.217 INFO  (indexFetcher-20-thread-1) [   ]
o.a.s.h.IndexFetcher Master's version: 1707372496922
2024-02-08 06:09:35.217 INFO  (indexFetcher-20-thread-1) [   ]
o.a.s.h.IndexFetcher Slave's generation: 442773
2024-02-08 06:09:35.217 INFO  (indexFetcher-20-thread-1) [   ]
o.a.s.h.IndexFetcher Slave's version: 1707372196789
2024-02-08 06:09:35.217 INFO  (indexFetcher-20-thread-1) [   ]
o.a.s.h.IndexFetcher Starting replication process
2024-02-08 06:09:35.223 INFO  (indexFetcher-20-thread-1) [   ]
o.a.s.h.IndexFetcher Number of files in latest index in master: 327
2024-02-08 06:09:35.227 INFO  (indexFetcher-20-thread-1) [   ]
o.a.s.h.IndexFetcher Starting download (fullCopy=false) to
NRTCachingDirectory(MMapDirectory@/<my-path>/<tlog-dir>/data/index.
20240208060935223 lockFactory=org.apache.lucene.store.NativeFSLockFactory
@7c5f7fe8; maxCacheMB=48.0 maxMergeSizeMB=4.0)
2024-02-08 06:09:35.227 INFO  (indexFetcher-20-thread-1) [   ]
o.a.s.h.IndexFetcher tmpIndexDir_type  : class
org.apache.lucene.store.NRTCachingDirectory
, MMapDirectory@<my-path>/<tlog-dir>/data/index.20240208060935223
lockFactory=org.apache.lucene.store.NativeFSLockFactory@7c5f7fe8
2024-02-08 06:09:35.778 INFO  (indexFetcher-20-thread-1) [   ]
o.a.s.h.IndexFetcher Bytes downloaded: 341525269, Bytes skipped
downloading: 0
2024-02-08 06:09:35.778 INFO  (indexFetcher-20-thread-1) [   ]
o.a.s.h.IndexFetcher Total time taken for download (fullCopy=false
,bytesDownloaded=341525269) : 0 secs (null bytes/sec) to
NRTCachingDirectory(MMapDirectory@/<my-path>/<tlog-dir>/data/index.
20240208060935223 lockFactory=org.apache.lucene.store.NativeFSLockFactory
@7c5f7fe8; maxCacheMB=48.0 maxMergeSizeMB=4.0)
...
# Logs from where we start to see the issue
2024-02-08 06:14:35.216 INFO  (indexFetcher-20-thread-1) [   ]
o.a.s.h.IndexFetcher Master's generation: 442775
2024-02-08 06:14:35.216 INFO  (indexFetcher-20-thread-1) [   ]
o.a.s.h.IndexFetcher Master's version: 1707372797037
2024-02-08 06:14:35.216 INFO  (indexFetcher-20-thread-1) [   ]
o.a.s.h.IndexFetcher Slave's generation: 442774
2024-02-08 06:14:35.216 INFO  (indexFetcher-20-thread-1) [   ]
o.a.s.h.IndexFetcher Slave's version: 1707372496922
2024-02-08 06:14:35.216 INFO  (indexFetcher-20-thread-1) [   ]
o.a.s.h.IndexFetcher Starting replication process
2024-02-08 06:14:35.222 INFO  (indexFetcher-20-thread-1) [   ]
o.a.s.h.IndexFetcher Number of files in latest index in master: 313

In the above, the indexFetcher-20-thread-1 thread completely disappeared
from our logs indicating no new index fetch activity happened. And even the
last index fetch activity didn't complete fully as Number of files in
latest index in master was the last line printed by it and it didn't
continue with downloading activity.

I may be wrong here, but my hunch is on the following two lines:
1. the thread either faced an exception and thus got killed and there is no
code which monitors that "some index fetcher thread" is always there.
2. the thread ended up getting blocked on some lock which it failed to
acquire or due to which it made an early exit and continued to do so (i.e.
the thread was still alive but since it wasn't reaching this code, we
didn't see any info level logs)

We don't have a process still running which has this issue so we are
working our way backwards for now based on logs and graphs.

If you could share some of the details around your issue and if the
observations overlap it may be helpful to reach a definitive conclusion.

For short-term monitoring, you may look to plot below metrics on the graph
and closely monitor and take avoiding action to ensure it doesn't impact
your service.
1. disk space: this graph use to shot up steadily as soon as the solr
process goes into the above state.
2. old-gen space and eden space: old gen used to shot and eden space used
to go down drastically making an X in the graph for us.


All of the above is my understanding so far. It may be better if someone
from the solr core contributors team can throw in some light here or
provide guidance/hints to proceed ahead.


On Fri, Feb 9, 2024 at 9:10 PM Gummadi, Ramesh
<rameshchandra.gumm...@ironmountain.com.invalid> wrote:

> TLOGs on TLOG replicas are growing continuously. That is the issue. They
> are growing 10x the index size.
>
> On Fri, Feb 9, 2024 at 4:40 PM 6harat <bharat.gulati.ce...@gmail.com>
> wrote:
>
> > Hi Ramesh,
> >
> > By  "replication tlogs are piling up" did you mean that you are seeing
> that
> > the tlog file under "<solr-data>/<core-name>/data/tlog/" directory
> becoming
> > substantially larger (10x) than your actual index size?
> >
> >
> > On Fri, Feb 9, 2024 at 3:48 PM Gummadi, Ramesh
> > <rameshchandra.gumm...@ironmountain.com.invalid> wrote:
> >
> > > I have a SolrCloud set up with 4 shards 2 replicas each. Both the
> leader
> > > and replica are in TLOG replication ( 2 TLOG replicas per shard). In
> this
> > > set up the replication tlogs are piling up on the non leader node. I
> > heard
> > > that they will be deleted on the replica node only when a leader change
> > > occurs. What is the solution to this issue.
> > > In the TLOG non leader replica, will the documents be replicated
> through
> > > peer sync when new documents are indexed on the leader.
> > >
> > > [image: image.png]
> > > --
> > > Ramesh Gummadi
> > >
> > >
> > > The information contained in this email message and its attachments is
> > > intended only for the private and confidential use of the recipient(s)
> > > named above, unless the sender expressly agrees otherwise. Transmission
> > of
> > > email over the Internet is not a secure communications medium. If you
> are
> > > requesting or have requested the transmittal of personal data, as
> defined
> > > in applicable privacy laws by means of email or in an attachment to
> > email,
> > > you must select a more secure alternate means of transmittal that
> > supports
> > > your obligations to protect such personal data.
> > >
> > > If the reader of this message is not the intended recipient and/or you
> > > have received this email in error, you must take no action based on the
> > > information in this email and you are hereby notified that any
> > > dissemination, misuse or copying or disclosure of this communication is
> > > strictly prohibited. If you have received this communication in error,
> > > please notify us immediately by email and delete the original message.
> > >
> >
> >
> > --
> > Regards
> > 6harat
> >
>
> --
> The information contained in this email message and its attachments is
> intended only for the private and confidential use of the recipient(s)
> named above, unless the sender expressly agrees otherwise. Transmission of
> email over the Internet is not a secure communications medium. If you are
> requesting or have requested the transmittal of personal data, as defined
> in applicable privacy laws by means of email or in an attachment to email,
> you must select a more secure alternate means of transmittal that supports
> your obligations to protect such personal data.
>
> If the reader of this
> message is not the intended recipient and/or you have received this email
> in error, you must take no action based on the information in this email
> and you are hereby notified that any dissemination, misuse or copying or
> disclosure of this communication is strictly prohibited. If you have
> received this communication in error, please notify us immediately by
> email
> and delete the original message.
>


-- 
Regards
6harat

Re: tlogs not getting purged in TLOG replica set up

Reply via email to