Gaël:

Many thanks for your writeup. I preserved your and Carlos’ comments in a JIRA: 
https://issues.apache.org/jira/browse/SOLR-14679.

How fast you bring the nodes up and down shouldn’t really matter, but if 
pausing between bouncing nodes when doing a rolling upgrade keeps you from 
having operational problems, then it’s the lesser of two evils.

Carlos:

Killing Solr forcefully can be problematic, that makes it much more likely that 
you’ll replay tlogs and trigger this problem. Mind you this is a real problem 
and being patient on shutdown is a band-aid… If it fits for you operationally 
and can 

1> stop ingesting
2> insure a hard commit happens before shutdown

then you should avoid a lot of this and shutdown should be very quick. We 
changed bin/solr to wait up to 180 seconds rather than 10 because “kill -9" was 
causing problems. I realize it’s not always possible to control. It’s kind of a 
pay-me-now-or-pay-me-later situation, the time saved killing Solr may be more 
than used up buy startup later. IIRC, failed leader elections were also more 
frequent with SIGKILL

Best,
Erick

> On Jul 24, 2020, at 5:52 AM, Gael Jourdan-Weil 
> <gael.jourdan-w...@kelkoogroup.com> wrote:
> 
> I think I've come down to the root cause of this mess in our case.
> 
> Everything is confirming that the TLOG state is "BUFFERING" rather than 
> "ACTIVE".
> 1/ This can be seen with the metrics API as well where we observe:
> "TLOG.replay.remaining.bytes":48997506,
> "TLOG.replay.remaining.logs":1,
> "TLOG.state":1,
> 2/ When a hard commit occurs, we can see it in the logs and as the index 
> files are updated ; but we can also see that postCommit and preCommit 
> UpdateLog methods are called but exits immediately which looking at the code 
> indicates the state is "BUFFERING".
> 
> So, why is this TLOG still in "BUFFERING" state?
> 
> From the code, the only place where state is set to "BUFFERING" seems to be 
> UpdateLog.bufferUpdates.
> From the logs, in our case it comes from recovery process. We see the message 
> "Begin buffering updates. core=[col_blue_shard1]".
> Just after we can see "Publishing state of core [col_blue_shard1] as 
> recovering, leader is [http://srv2/solr/col_blue_shard1/] and I am 
> [http://srv1/solr/col_blue_shard1/]";.
> 
> Until here, everything is expected I guess but why the TLOG state is not set 
> to "ACTIVE" a bit later?
> 
> Well, the "Begin buffering updates" occured and 500ms later we can see:
> - "Updated live nodes from ZooKeeper... (2) -> (1)" (I think at this time we 
> shut down srv2, this is our main cause of problem)
> - "I am going to be the leader srv1"
> - "Stopping recovery for core=[col_blue_shard1] coreNodeName=[core_node1]"
> And 2s later:
> - "Attempting to PeerSync from [http://srv2/solr/es_blue_shard1/] - 
> recoveringAfterStartup=[true]"
> - "Error while trying to recover. 
> core=es_blue_shard1:org.apache.solr.common.SolrException: Failed to get 
> fingerprint from leader"
> - "Finished recovery process, successful=[false]"
> 
> At this point, I think the root cause on our side is a rolling update that we 
> did too quickly: we stopped node2 while node1 while recovering from it.
> 
> It's still not clear how everything went back to "active" state after such a 
> failed recovery and a TLOG still in "BUFFERING".
> 
> We shouldn't have been in recovery in the first place and I think we know 
> why, this is a first thing that we have adressed.
> Then we need to add some pauses in our rolling update strategy.
> 
> Does it makes sense? Can you think of something else to check/improve?
> 
> Best Regards,
> Gaël

Reply via email to