I think I've come down to the root cause of this mess in our case.

Everything is confirming that the TLOG state is "BUFFERING" rather than 
"ACTIVE".
1/ This can be seen with the metrics API as well where we observe:
"TLOG.replay.remaining.bytes":48997506,
"TLOG.replay.remaining.logs":1,
"TLOG.state":1,
2/ When a hard commit occurs, we can see it in the logs and as the index files 
are updated ; but we can also see that postCommit and preCommit UpdateLog 
methods are called but exits immediately which looking at the code indicates 
the state is "BUFFERING".

So, why is this TLOG still in "BUFFERING" state?

>From the code, the only place where state is set to "BUFFERING" seems to be 
>UpdateLog.bufferUpdates.
>From the logs, in our case it comes from recovery process. We see the message 
>"Begin buffering updates. core=[col_blue_shard1]".
Just after we can see "Publishing state of core [col_blue_shard1] as 
recovering, leader is [http://srv2/solr/col_blue_shard1/] and I am 
[http://srv1/solr/col_blue_shard1/]";.

Until here, everything is expected I guess but why the TLOG state is not set to 
"ACTIVE" a bit later?

Well, the "Begin buffering updates" occured and 500ms later we can see:
- "Updated live nodes from ZooKeeper... (2) -> (1)" (I think at this time we 
shut down srv2, this is our main cause of problem)
- "I am going to be the leader srv1"
- "Stopping recovery for core=[col_blue_shard1] coreNodeName=[core_node1]"
And 2s later:
- "Attempting to PeerSync from [http://srv2/solr/es_blue_shard1/] - 
recoveringAfterStartup=[true]"
- "Error while trying to recover. 
core=es_blue_shard1:org.apache.solr.common.SolrException: Failed to get 
fingerprint from leader"
- "Finished recovery process, successful=[false]"

At this point, I think the root cause on our side is a rolling update that we 
did too quickly: we stopped node2 while node1 while recovering from it.

It's still not clear how everything went back to "active" state after such a 
failed recovery and a TLOG still in "BUFFERING".

We shouldn't have been in recovery in the first place and I think we know why, 
this is a first thing that we have adressed.
Then we need to add some pauses in our rolling update strategy.

Does it makes sense? Can you think of something else to check/improve?

Best Regards,
Gaƫl

Reply via email to