Inserts are distributed to already failed nodes in the group
------------------------------------------------------------

         Key: SEQUOIA-980
         URL: https://forge.continuent.org/jira/browse/SEQUOIA-980
     Project: Sequoia
        Type: Bug

    Versions: sequoia 2.10.10, Sequoia 2.10.9, Sequoia 2.10.8, Sequoia 2.10.7, 
Sequoia 2.10.6, Sequoia 2.10.5    
 Environment: Sequoia 2.10.5 and 2.10.9 using Appia as group communication 
protocoll
    Reporter: Stefan Lischke
    Priority: Blocker


Using 2 synced controllers with 1 backend each. 
Starting a loop that inserts stuff into the VDB every second using controller 
2. 
While filling up both backends, i manually stopped communication between both 
controllers (test with : removing network cable / networking stop / firewalling 
/ routing) 

Both controllers notice that the other failed in the group and they only see 
themselfes according to the log. Controller 2 accepts writes and fills up his 
VDB.
Everything ok till now. Problem below

controller 1
----
2007-09-19 17:47:53,039 INFO  continuent.hedera.gms 
Member(address=/192.168.0.114:21081, uid=192.168.0.114:21081) failed in 
Group(gid=messenger)
2007-09-19 17:47:53,041 WARN  controller.virtualdatabase.messenger Controller 
Member(address=/192.168.0.114:21081, uid=192.168.0.114:21081) has left the 
cluster.
2007-09-19 17:47:53,042 INFO  controller.virtualdatabase.messenger 0 requests 
were waiting responses from Member(address=/192.168.0.114:21081, 
uid=192.168.0.114:21081)
2007-09-19 17:47:53,045 INFO  controller.requestmanager.cleanup Waiting 60000ms 
for client of controller 562949953421312 to failover
2007-09-19 17:48:09,553 INFO  controller.virtualdatabase.messenger 
Member(address=/192.168.0.102:21081, uid=192.168.0.102:21081) see 
members:[Member(address=/192.168.0.102:21081, uid=192.168.0.102:21081)] and has 
mapping:{Member(address=/192.168.0.102:21081, 
uid=192.168.0.102:21081)=192.168.0.102:21090}
2007-09-19 17:48:53,049 INFO  controller.requestmanager.cleanup Cleanup for 
controller 562949953421312 failure is completed.
----

controller 2 
----
2007-09-19 17:47:50,258 INFO  continuent.hedera.gms 
Member(address=/192.168.0.102:21081, uid=192.168.0.102:21081) failed in 
Group(gid=messenger)
2007-09-19 17:47:50,262 WARN  controller.virtualdatabase.messenger Controller 
Member(address=/192.168.0.102:21081, uid=192.168.0.102:21081) has left the 
cluster.
2007-09-19 17:47:50,263 INFO  controller.virtualdatabase.messenger 1 requests 
were waiting responses from Member(address=/192.168.0.102:21081, 
uid=192.168.0.102:21081)
2007-09-19 17:47:50,263 WARN  controller.RequestManager.messenger 1 
controller(s) died during execution of request 562949953421464
2007-09-19 17:47:50,263 WARN  controller.RequestManager.messenger Controller 
Member(address=/192.168.0.102:21081, uid=192.168.0.102:21081) is suspected of 
failure.2007-09-19 17:47:50,270 INFO  controller.requestmanager.cleanup Waiting 
60000ms for client of controller 0 to failover
2007-09-19 17:48:50,275 INFO  controller.requestmanager.cleanup Cleanup for 
controller 0 failure is completed.
----

The Problem is, when connection is available again between controller 1 and 
controller 2 (tested with plugin network cable / network start / firewalling 
/routing)
Controller 2 distributes all new inserts (after they see each other again) to 
controller 1 again.

controller 1 sees controller 2:
2007-09-19 17:50:09,578 INFO  controller.virtualdatabase.messenger 
Member(address=/192.168.0.102:21081, uid=192.168.0.102:21081) see 
members:[Member(address=/192.168.0.102:21081, uid=192.168.0.102:21081), 
Member(address=/192.168.0.114:21081, uid=192.168.0.114:21081)] and has 
mapping:{Member(address=/192.168.0.102:21081, 
uid=192.168.0.102:21081)=192.168.0.102:21090}

controller 2 sees controller 1:
2007-09-19 17:49:00,719 INFO  controller.virtualdatabase.messenger 
Member(address=/192.168.0.114:21081, uid=192.168.0.114:21081) see 
members:[Member(address=/192.168.0.102:21081, uid=192.168.0.102:21081), 
Member(address=/192.168.0.114:21081, uid=192.168.0.114:21081)] and has 
mapping:{Member(address=/192.168.0.114:21081, 
uid=192.168.0.114:21081)=192.168.0.114:21090}

Thats really ugly cause the database of controller 1 now has fewer entries then 
controller 2, cause all inserts that are made while controller 1 was not 
connected are only written to the backend of controller 2. 

Two solutions:
* If sequoia recognises a failed controller, no subsequent inserts should be 
distributed the the once failed node
* If sequoia recognises a returned failed controller, the recovery log should 
be played to the failed node to update him on latest inserts he missed.

This is a very serious bug, that can be seen repeatedly in more of my test 
cases.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   https://forge.continuent.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

_______________________________________________
Sequoia mailing list
[email protected]
https://forge.continuent.org/mailman/listinfo/sequoia

Reply via email to