Inserts are distributed to already failed nodes in the group
------------------------------------------------------------
Key: SEQUOIA-980
URL: https://forge.continuent.org/jira/browse/SEQUOIA-980
Project: Sequoia
Type: Bug
Versions: sequoia 2.10.10, Sequoia 2.10.9, Sequoia 2.10.8, Sequoia 2.10.7,
Sequoia 2.10.6, Sequoia 2.10.5
Environment: Sequoia 2.10.5 and 2.10.9 using Appia as group communication
protocoll
Reporter: Stefan Lischke
Priority: Blocker
Using 2 synced controllers with 1 backend each.
Starting a loop that inserts stuff into the VDB every second using controller
2.
While filling up both backends, i manually stopped communication between both
controllers (test with : removing network cable / networking stop / firewalling
/ routing)
Both controllers notice that the other failed in the group and they only see
themselfes according to the log. Controller 2 accepts writes and fills up his
VDB.
Everything ok till now. Problem below
controller 1
----
2007-09-19 17:47:53,039 INFO continuent.hedera.gms
Member(address=/192.168.0.114:21081, uid=192.168.0.114:21081) failed in
Group(gid=messenger)
2007-09-19 17:47:53,041 WARN controller.virtualdatabase.messenger Controller
Member(address=/192.168.0.114:21081, uid=192.168.0.114:21081) has left the
cluster.
2007-09-19 17:47:53,042 INFO controller.virtualdatabase.messenger 0 requests
were waiting responses from Member(address=/192.168.0.114:21081,
uid=192.168.0.114:21081)
2007-09-19 17:47:53,045 INFO controller.requestmanager.cleanup Waiting 60000ms
for client of controller 562949953421312 to failover
2007-09-19 17:48:09,553 INFO controller.virtualdatabase.messenger
Member(address=/192.168.0.102:21081, uid=192.168.0.102:21081) see
members:[Member(address=/192.168.0.102:21081, uid=192.168.0.102:21081)] and has
mapping:{Member(address=/192.168.0.102:21081,
uid=192.168.0.102:21081)=192.168.0.102:21090}
2007-09-19 17:48:53,049 INFO controller.requestmanager.cleanup Cleanup for
controller 562949953421312 failure is completed.
----
controller 2
----
2007-09-19 17:47:50,258 INFO continuent.hedera.gms
Member(address=/192.168.0.102:21081, uid=192.168.0.102:21081) failed in
Group(gid=messenger)
2007-09-19 17:47:50,262 WARN controller.virtualdatabase.messenger Controller
Member(address=/192.168.0.102:21081, uid=192.168.0.102:21081) has left the
cluster.
2007-09-19 17:47:50,263 INFO controller.virtualdatabase.messenger 1 requests
were waiting responses from Member(address=/192.168.0.102:21081,
uid=192.168.0.102:21081)
2007-09-19 17:47:50,263 WARN controller.RequestManager.messenger 1
controller(s) died during execution of request 562949953421464
2007-09-19 17:47:50,263 WARN controller.RequestManager.messenger Controller
Member(address=/192.168.0.102:21081, uid=192.168.0.102:21081) is suspected of
failure.2007-09-19 17:47:50,270 INFO controller.requestmanager.cleanup Waiting
60000ms for client of controller 0 to failover
2007-09-19 17:48:50,275 INFO controller.requestmanager.cleanup Cleanup for
controller 0 failure is completed.
----
The Problem is, when connection is available again between controller 1 and
controller 2 (tested with plugin network cable / network start / firewalling
/routing)
Controller 2 distributes all new inserts (after they see each other again) to
controller 1 again.
controller 1 sees controller 2:
2007-09-19 17:50:09,578 INFO controller.virtualdatabase.messenger
Member(address=/192.168.0.102:21081, uid=192.168.0.102:21081) see
members:[Member(address=/192.168.0.102:21081, uid=192.168.0.102:21081),
Member(address=/192.168.0.114:21081, uid=192.168.0.114:21081)] and has
mapping:{Member(address=/192.168.0.102:21081,
uid=192.168.0.102:21081)=192.168.0.102:21090}
controller 2 sees controller 1:
2007-09-19 17:49:00,719 INFO controller.virtualdatabase.messenger
Member(address=/192.168.0.114:21081, uid=192.168.0.114:21081) see
members:[Member(address=/192.168.0.102:21081, uid=192.168.0.102:21081),
Member(address=/192.168.0.114:21081, uid=192.168.0.114:21081)] and has
mapping:{Member(address=/192.168.0.114:21081,
uid=192.168.0.114:21081)=192.168.0.114:21090}
Thats really ugly cause the database of controller 1 now has fewer entries then
controller 2, cause all inserts that are made while controller 1 was not
connected are only written to the backend of controller 2.
Two solutions:
* If sequoia recognises a failed controller, no subsequent inserts should be
distributed the the once failed node
* If sequoia recognises a returned failed controller, the recovery log should
be played to the failed node to update him on latest inserts he missed.
This is a very serious bug, that can be seen repeatedly in more of my test
cases.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://forge.continuent.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
_______________________________________________
Sequoia mailing list
[email protected]
https://forge.continuent.org/mailman/listinfo/sequoia