[ https://forge.continuent.org/jira/browse/SEQUOIA-980?page=comments#action_14357 ]
Emmanuel Cecchet commented on SEQUOIA-980: ------------------------------------------ As stated in the documentation, network partitions are not supported. There are some mechanisms in Sequoia that should detect the partition a-posteriori but I don't know why this does not work in your use case. Maybe we are missing something in the Appia adapter. > Inserts are distributed to already failed nodes in the group > ------------------------------------------------------------ > > Key: SEQUOIA-980 > URL: https://forge.continuent.org/jira/browse/SEQUOIA-980 > Project: Sequoia > Type: Bug > Versions: sequoia 2.10.10, Sequoia 2.10.9, Sequoia 2.10.8, Sequoia > 2.10.7, Sequoia 2.10.6, Sequoia 2.10.5 > Environment: Sequoia 2.10.5 and 2.10.9 using Appia as group communication > protocoll > Reporter: Stefan Lischke > Priority: Blocker > > > Using 2 synced controllers with 1 backend each. > Starting a loop that inserts stuff into the VDB every second using controller > 2. > While filling up both backends, i manually stopped communication between both > controllers (test with : removing network cable / networking stop / > firewalling / routing) > Both controllers notice that the other failed in the group and they only see > themselfes according to the log. Controller 2 accepts writes and fills up his > VDB. > Everything ok till now. Problem below > controller 1 > ---- > 2007-09-19 17:47:53,039 INFO continuent.hedera.gms > Member(address=/192.168.0.114:21081, uid=192.168.0.114:21081) failed in > Group(gid=messenger) > 2007-09-19 17:47:53,041 WARN controller.virtualdatabase.messenger Controller > Member(address=/192.168.0.114:21081, uid=192.168.0.114:21081) has left the > cluster. > 2007-09-19 17:47:53,042 INFO controller.virtualdatabase.messenger 0 requests > were waiting responses from Member(address=/192.168.0.114:21081, > uid=192.168.0.114:21081) > 2007-09-19 17:47:53,045 INFO controller.requestmanager.cleanup Waiting > 60000ms for client of controller 562949953421312 to failover > 2007-09-19 17:48:09,553 INFO controller.virtualdatabase.messenger > Member(address=/192.168.0.102:21081, uid=192.168.0.102:21081) see > members:[Member(address=/192.168.0.102:21081, uid=192.168.0.102:21081)] and > has mapping:{Member(address=/192.168.0.102:21081, > uid=192.168.0.102:21081)=192.168.0.102:21090} > 2007-09-19 17:48:53,049 INFO controller.requestmanager.cleanup Cleanup for > controller 562949953421312 failure is completed. > ---- > controller 2 > ---- > 2007-09-19 17:47:50,258 INFO continuent.hedera.gms > Member(address=/192.168.0.102:21081, uid=192.168.0.102:21081) failed in > Group(gid=messenger) > 2007-09-19 17:47:50,262 WARN controller.virtualdatabase.messenger Controller > Member(address=/192.168.0.102:21081, uid=192.168.0.102:21081) has left the > cluster. > 2007-09-19 17:47:50,263 INFO controller.virtualdatabase.messenger 1 requests > were waiting responses from Member(address=/192.168.0.102:21081, > uid=192.168.0.102:21081) > 2007-09-19 17:47:50,263 WARN controller.RequestManager.messenger 1 > controller(s) died during execution of request 562949953421464 > 2007-09-19 17:47:50,263 WARN controller.RequestManager.messenger Controller > Member(address=/192.168.0.102:21081, uid=192.168.0.102:21081) is suspected of > failure.2007-09-19 17:47:50,270 INFO controller.requestmanager.cleanup > Waiting 60000ms for client of controller 0 to failover > 2007-09-19 17:48:50,275 INFO controller.requestmanager.cleanup Cleanup for > controller 0 failure is completed. > ---- > The Problem is, when connection is available again between controller 1 and > controller 2 (tested with plugin network cable / network start / firewalling > /routing) > Controller 2 distributes all new inserts (after they see each other again) to > controller 1 again. > controller 1 sees controller 2: > 2007-09-19 17:50:09,578 INFO controller.virtualdatabase.messenger > Member(address=/192.168.0.102:21081, uid=192.168.0.102:21081) see > members:[Member(address=/192.168.0.102:21081, uid=192.168.0.102:21081), > Member(address=/192.168.0.114:21081, uid=192.168.0.114:21081)] and has > mapping:{Member(address=/192.168.0.102:21081, > uid=192.168.0.102:21081)=192.168.0.102:21090} > controller 2 sees controller 1: > 2007-09-19 17:49:00,719 INFO controller.virtualdatabase.messenger > Member(address=/192.168.0.114:21081, uid=192.168.0.114:21081) see > members:[Member(address=/192.168.0.102:21081, uid=192.168.0.102:21081), > Member(address=/192.168.0.114:21081, uid=192.168.0.114:21081)] and has > mapping:{Member(address=/192.168.0.114:21081, > uid=192.168.0.114:21081)=192.168.0.114:21090} > Thats really ugly cause the database of controller 1 now has fewer entries > then controller 2, cause all inserts that are made while controller 1 was not > connected are only written to the backend of controller 2. > Two solutions: > * If sequoia recognises a failed controller, no subsequent inserts should be > distributed the the once failed node > * If sequoia recognises a returned failed controller, the recovery log should > be played to the failed node to update him on latest inserts he missed. > This is a very serious bug, that can be seen repeatedly in more of my test > cases. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://forge.continuent.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira _______________________________________________ Sequoia mailing list [email protected] https://forge.continuent.org/mailman/listinfo/sequoia
