Data loss for actions happening after a truncate in 3.4.3

Christian Ziech Fri, 15 Jun 2012 09:38:58 -0700

This issue seems to only affect zookeeper 3.4.3 (and not 3.3.5).Basically it seems that after the truncate method is invoked, thelogStream member of the FileTxnLog is still pointing to the old positionin the file where it would have written the next entry before thetruncate happened. Since the log file is not rolled over or the streamto reset, now a gap in the file is created, that would be interpretedwhen reading the log as an end of that file.

That means once this node becomes leader later on, it would send asnapshot to all its peer that only contains entries up to truncation -all entries thereafter would not be sent. We had this happening on atest cluster on 2/3 zookeeper servers while the network connection wasbad. Even after the nodes recovered we would loose all the data everytime the leader switches to one of those two nodes.

Furthermore (and that is a thing I could not 100% reproduce yet) itseems that there are some situations when the transaction log file wouldnot only contain a gap but also just stop after the last entry beforethe truncation after some leader changes.

I have a small program that is able to reproduce the error safely for3.4.3 but not for 3.3.5. That seems to be related to the new leader in3.3.5 not sending the truncation message to the peer that was moreadvanced than the new leader, but the actual problem seems also be therein 3.3.5 (I just couldn't get the TRUNC message to be sent in my test).


Do other people have encountered the same issue already?

I will create a ticket with the test that reproduces the issue later,but before I will need to spend some more time on that script (thingsare a little hard to reproduce because I have to pull a zookeeper serverout of the ensemble for some time without restarting it, to do so I'musing port-forwarding which I can interrupt even on localhost instead ofdirect connections).


What more information do you guys need to investigate the issue?

Data loss for actions happening after a truncate in 3.4.3

Reply via email to