[Sequoia] [JIRA] Commented: (SEQUOIA-980) Inserts are distributed to already failed nodes in the group

Stefan Lischke (JIRA) Thu, 20 Sep 2007 02:21:34 -0700

    [ 
https://forge.continuent.org/jira/browse/SEQUOIA-980?page=comments#action_14090 
]


Stefan Lischke commented on SEQUOIA-980:
----------------------------------------

I debugged this scenario and came to conclusion that this is a bug by 
communication design.
If network link is down between both controllers, they recieve both
* AbstractGroupMembershipService.failedMember(..)
which removes the failed member from the group. The listener Method
* DistributedVirtualDatabase.failedMember()
makes the checkpoint and calls quitMember()

if network link is up again every controller recieves 
* AbstractGroupMembershipService.joinMember()
which adds the new member to the group and calls the listener interface 
* DistributedVirtualDatabase.joinMember()
which only adds a checkpoint.

And that the bug, in DistributedVirtualDatabase.joinMember() we have to check 
for consistency and replay recovery log if needed like done in
DistributedVirtualDatabase.joinGroup()

I'll hack around this.... any objections?

> Inserts are distributed to already failed nodes in the group
> ------------------------------------------------------------
>
>          Key: SEQUOIA-980
>          URL: https://forge.continuent.org/jira/browse/SEQUOIA-980
>      Project: Sequoia
>         Type: Bug

>     Versions: sequoia 2.10.10, Sequoia 2.10.9, Sequoia 2.10.8, Sequoia 
> 2.10.7, Sequoia 2.10.6, Sequoia 2.10.5
>  Environment: Sequoia 2.10.5 and 2.10.9 using Appia as group communication 
> protocoll
>     Reporter: Stefan Lischke
>     Priority: Blocker

>
>
> Using 2 synced controllers with 1 backend each. 
> Starting a loop that inserts stuff into the VDB every second using controller 
> 2. 
> While filling up both backends, i manually stopped communication between both 
> controllers (test with : removing network cable / networking stop / 
> firewalling / routing) 
> Both controllers notice that the other failed in the group and they only see 
> themselfes according to the log. Controller 2 accepts writes and fills up his 
> VDB.
> Everything ok till now. Problem below
> controller 1
> ----
> 2007-09-19 17:47:53,039 INFO  continuent.hedera.gms 
> Member(address=/192.168.0.114:21081, uid=192.168.0.114:21081) failed in 
> Group(gid=messenger)
> 2007-09-19 17:47:53,041 WARN  controller.virtualdatabase.messenger Controller 
> Member(address=/192.168.0.114:21081, uid=192.168.0.114:21081) has left the 
> cluster.
> 2007-09-19 17:47:53,042 INFO  controller.virtualdatabase.messenger 0 requests 
> were waiting responses from Member(address=/192.168.0.114:21081, 
> uid=192.168.0.114:21081)
> 2007-09-19 17:47:53,045 INFO  controller.requestmanager.cleanup Waiting 
> 60000ms for client of controller 562949953421312 to failover
> 2007-09-19 17:48:09,553 INFO  controller.virtualdatabase.messenger 
> Member(address=/192.168.0.102:21081, uid=192.168.0.102:21081) see 
> members:[Member(address=/192.168.0.102:21081, uid=192.168.0.102:21081)] and 
> has mapping:{Member(address=/192.168.0.102:21081, 
> uid=192.168.0.102:21081)=192.168.0.102:21090}
> 2007-09-19 17:48:53,049 INFO  controller.requestmanager.cleanup Cleanup for 
> controller 562949953421312 failure is completed.
> ----
> controller 2 
> ----
> 2007-09-19 17:47:50,258 INFO  continuent.hedera.gms 
> Member(address=/192.168.0.102:21081, uid=192.168.0.102:21081) failed in 
> Group(gid=messenger)
> 2007-09-19 17:47:50,262 WARN  controller.virtualdatabase.messenger Controller 
> Member(address=/192.168.0.102:21081, uid=192.168.0.102:21081) has left the 
> cluster.
> 2007-09-19 17:47:50,263 INFO  controller.virtualdatabase.messenger 1 requests 
> were waiting responses from Member(address=/192.168.0.102:21081, 
> uid=192.168.0.102:21081)
> 2007-09-19 17:47:50,263 WARN  controller.RequestManager.messenger 1 
> controller(s) died during execution of request 562949953421464
> 2007-09-19 17:47:50,263 WARN  controller.RequestManager.messenger Controller 
> Member(address=/192.168.0.102:21081, uid=192.168.0.102:21081) is suspected of 
> failure.2007-09-19 17:47:50,270 INFO  controller.requestmanager.cleanup 
> Waiting 60000ms for client of controller 0 to failover
> 2007-09-19 17:48:50,275 INFO  controller.requestmanager.cleanup Cleanup for 
> controller 0 failure is completed.
> ----
> The Problem is, when connection is available again between controller 1 and 
> controller 2 (tested with plugin network cable / network start / firewalling 
> /routing)
> Controller 2 distributes all new inserts (after they see each other again) to 
> controller 1 again.
> controller 1 sees controller 2:
> 2007-09-19 17:50:09,578 INFO  controller.virtualdatabase.messenger 
> Member(address=/192.168.0.102:21081, uid=192.168.0.102:21081) see 
> members:[Member(address=/192.168.0.102:21081, uid=192.168.0.102:21081), 
> Member(address=/192.168.0.114:21081, uid=192.168.0.114:21081)] and has 
> mapping:{Member(address=/192.168.0.102:21081, 
> uid=192.168.0.102:21081)=192.168.0.102:21090}
> controller 2 sees controller 1:
> 2007-09-19 17:49:00,719 INFO  controller.virtualdatabase.messenger 
> Member(address=/192.168.0.114:21081, uid=192.168.0.114:21081) see 
> members:[Member(address=/192.168.0.102:21081, uid=192.168.0.102:21081), 
> Member(address=/192.168.0.114:21081, uid=192.168.0.114:21081)] and has 
> mapping:{Member(address=/192.168.0.114:21081, 
> uid=192.168.0.114:21081)=192.168.0.114:21090}
> Thats really ugly cause the database of controller 1 now has fewer entries 
> then controller 2, cause all inserts that are made while controller 1 was not 
> connected are only written to the backend of controller 2. 
> Two solutions:
> * If sequoia recognises a failed controller, no subsequent inserts should be 
> distributed the the once failed node
> * If sequoia recognises a returned failed controller, the recovery log should 
> be played to the failed node to update him on latest inserts he missed.
> This is a very serious bug, that can be seen repeatedly in more of my test 
> cases.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   https://forge.continuent.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

_______________________________________________
Sequoia mailing list
[email protected]
https://forge.continuent.org/mailman/listinfo/sequoia

[Sequoia] [JIRA] Commented: (SEQUOIA-980) Inserts are distributed to already failed nodes in the group

Reply via email to