Created https://issues.apache.org/jira/browse/ZOOKEEPER-1489 - please
let me know if the test does not fail as intended (did for me) or
otherwise did not show the problem correctly (or if you need anything else).
Am 15.06.2012 19:51, schrieb ext Patrick Hunt:
Please do create a jira for this. If you have a reproducible test
case, or even just steps to reproduce that will be useful. Sounds like
something we'll need to get into 3.4.4 at the very least.
Thanks for the report!
Patrick
On Fri, Jun 15, 2012 at 9:38 AM, Christian Ziech
<[email protected]> wrote:
This issue seems to only affect zookeeper 3.4.3 (and not 3.3.5). Basically
it seems that after the truncate method is invoked, the logStream member of
the FileTxnLog is still pointing to the old position in the file where it
would have written the next entry before the truncate happened. Since the
log file is not rolled over or the stream to reset, now a gap in the file is
created, that would be interpreted when reading the log as an end of that
file.
That means once this node becomes leader later on, it would send a snapshot
to all its peer that only contains entries up to truncation - all entries
thereafter would not be sent. We had this happening on a test cluster on 2/3
zookeeper servers while the network connection was bad. Even after the nodes
recovered we would loose all the data every time the leader switches to one
of those two nodes.
Furthermore (and that is a thing I could not 100% reproduce yet) it seems
that there are some situations when the transaction log file would not only
contain a gap but also just stop after the last entry before the truncation
after some leader changes.
I have a small program that is able to reproduce the error safely for 3.4.3
but not for 3.3.5. That seems to be related to the new leader in 3.3.5 not
sending the truncation message to the peer that was more advanced than the
new leader, but the actual problem seems also be there in 3.3.5 (I just
couldn't get the TRUNC message to be sent in my test).
Do other people have encountered the same issue already?
I will create a ticket with the test that reproduces the issue later, but
before I will need to spend some more time on that script (things are a
little hard to reproduce because I have to pull a zookeeper server out of
the ensemble for some time without restarting it, to do so I'm using
port-forwarding which I can interrupt even on localhost instead of direct
connections).
What more information do you guys need to investigate the issue?
--
*NOKIA*
*Christian Ziech*
Senior Software Developer
Context Based Services
Services & Software
Mobile: +4915155155740
Fax: +493044676555
eMail: [email protected]
Nokia gate5 GmbH
Invalidenstr. 117
10115 Berlin, Germany
www.maps.nokia.com <http://www.maps.nokia.com>
www.smart2go.com <http://www.smart2go.com>
Nokia gate5 GmbH, Sitz der Gesellschaft: Berlin, Amtsgericht
Charlottenburg: HRB 106443 B, Steuernr.: 37/222/20817, ID/VAT-Nr.: DE
812 845 193, Geschäftsführer: Dr. Michael Halbherr, Karim Tähtivuori