> Are yours growing always, on all nodes, forever? Or is it one or two who
ends up in a bad state?
Randomly on some of the shards and some of the followers in the collection.
Then whichever tlog was open on follower when it was the leader, that one
doesn't stops growing. And that shard had active
I've run into this (or similar) issues in the past (solr6? I don't
remember exactly) where tlogs get stuck either growing indefinitely
and/or refusing to commit on restart.
What I ended up doing was writing a monitor to check for the number of
tlogs and alert if they got over some limit (100 or wh
Looks like the problem is related to tlog rotation on the follower shard.
We did the following for a specific shard.
0. start solr cloud
1. solr-0 (leader), solr-1, solr-2
2. rebalance to make solr-1 as preferred leader
3. solr-0, solr-1 (leader), solr-2
The tlog file on solr-0 kept on growing i
Looks like the problem is related to tlog rotation on the follower shard.
We did the following for a specific shard.
0. start solr cloud
1. solr-0 (leader), solr-1, solr-2
2. rebalance to make solr-1 as preferred leader
3. solr-0, solr-1 (leader), solr-2
The tlog file on solr-0 kept on growing i
We found that for the shard that does not get a leader, the tlog replay did
not complete (we don't see "log replay finished", "creating leader
registration node", "I am the new leader" etc log messages) for hours.
Also not sure why the TLOG are 10's of GBs (anywhere from 30 to 40GB).
Collectio
By tracing the output in the log files we see the following sequence.
Overseer role list has POD-1, POD-2, POD-3 in that order
POD-3 has 2 shard leaders.
POD-3 restarts.
A) Logs for the shard whose leader moves successfully from POD-3 to POD-1
On POD-1: o.a.s.c.ShardLeaderElectionContext Replay
I haven’t delved into the exact reason for this, but what generally helps
to avoid this situation in a cluster is
i) During shutdown (in case you need to restart the cluster), let the
overseer node be the last one to shut down.
ii) While restarting, let the Overseer node be the first one to start
i
Hello,
On reboot of one of the solr nodes in the cluster, we often see a
collection's shards with
1. LEADER replica in DOWN state, and/or
2. shard with no LEADER
Output from /solr/admin/collections?action=CLUSTERSTATUS is below.
Even after 5 to 10 minutes, the collection often does not recover.