Hi Francis, *sarcasm* This looks like normal Sequoia behavior ;-) *sarcasm*
But have you looked at the load on the controller machines. If load is
too high and the Appia timeout is to short, you get this "failed in
group" Problem 1 behavior.
But the Problem 2 is really bad, i try to explain to you.
If Controller 2 says Controller 1 failed, it still accepts write
requests. And the write Requests are not distributed to the failed
controller. So the DB's are not in sync anymore. Now when the load gets
lower, Controller 1 automatically rejoins (he does not tell you about
this) and the following writes are distributed again. Then you have a
lot of problems, like yours.
I think this problem is by design, cause there is not splitbrain
scenario. If one controller was lost and only one write was not
distributed to him, the failed controller should NEVER rejoin the
cluster. They should work in splitbrain scenario.
This scenario does not only apply to high loads, its also a big problem
when there is a very short network split (you should never use internet
connection between two nodes, always use crosslinked network connection,
or an own network with an own switch)
I tried to talk about this problem some time ago and i opened some bugs,
but we never discussed this big issue in this community. ;-(
If its the load on your system, try to change the following values in
the appia.xml
<chsession name="suspectl">
<parameter name="suspect_sweep">10000</parameter>
<parameter name="suspect_time">30000</parameter>
</chsession>
hth stefan
Francis Chong wrote:
> Hi,
>
> We are examining Sequoia as a HA solution. However we encounter following
> problems after running some hundred thousands of transaction:
>
> Setup:
> Two controllers at two servers, each connecting to one Postgres database
> backend. Client using "ordered" load balancing to connect to controller 1
> then controller 2.
>
> Problems:
> 1. After tens of thousands of transactions, the Appia group is disconnected.
>
> 2007-12-19 13:59:54,523 INFO continuent.hedera.gms
> Member(address=/128.128.3.30:43994, uid=128.128.3.30:43994) failed in
> Group(gid=s2)
>
> 2. After problem at 1, controller 1 encountered write error due to unique
> key constraint. The controller 1 is then disabled.
>
> 2007-12-19 14:00:32,586 ERROR controller.loadbalancer.RAIDb1 write request
> 844424930142550 failed:
> Backend s2 - BackendWorkerThread for backend 'middleware1' with RAIDb
> level:1 failed (ERROR: duplicate key violates unique constraint
> "pk_sale_order")
>
>
> 3. After problem at 2, controller 1 is automatically disabled ("show backend
> *" show it was disabled). Here we expects all clients will now connect to
> controller 2. However, clients can still connect to controller 1, only
> return error when they execute query:
>
> Error at client:
> org.continuent.sequoia.common.exceptions.driver.DriverSQLException: Message
> of cause: null, SQL State: null, Error Code: 1
> org.continuent.sequoia.common.exceptions.driver.DriverSQLException: Message
> of cause: null at
> org.continuent.sequoia.driver.Connection.statementExecuteQuery(Connection.ja
> va:2840) at
> org.continuent.sequoia.driver.Statement.executeQuery(Statement.java:522)
> at org.continuent.sequoia.driver.Statement.executeQuery(Statement.java:495)
> at
> ...
> Caused by:
> org.continuent.sequoia.common.exceptions.driver.protocol.ControllerCoreExcep
> tion
> SerializableStackTrace of each cause:
> org.continuent.sequoia.common.exceptions.driver.protocol.ControllerCoreExcep
> tion
> at
> org.continuent.sequoia.controller.requestmanager.distributed.RAIDb1Distribut
> edRequestManager.execRemoteStatementExecuteQue
> ry(RAIDb1DistributedRequestManager.java:170)
> at
> org.continuent.sequoia.controller.requestmanager.distributed.DistributedRequ
> estManager.statementExecuteQuery(DistributedRe
> questManager.java:1370)
> at
> org.continuent.sequoia.controller.virtualdatabase.VirtualDatabase.statementE
> xecuteQuery(VirtualDatabase.java:549)
> at
> org.continuent.sequoia.controller.virtualdatabase.VirtualDatabaseWorkerThrea
> d.statementExecuteQuery(VirtualDatabaseWorkerT
> hread.java:2175)
> at
> org.continuent.sequoia.controller.virtualdatabase.VirtualDatabaseWorkerThrea
> d.run(VirtualDatabaseWorkerThread.java:442)
>
> Error at controller:
> 2007-12-19 14:33:13,167 WARN controller.RequestManager.sunbeam-s2 An error
> occured while executing remote select request 281474977312374
> org.continuent.sequoia.common.exceptions.NoMoreBackendException
> at
> org.continuent.sequoia.controller.requestmanager.distributed.RAIDb1Distribut
> edRequestManager.execRemoteStatementExecuteQue
> ry(RAIDb1DistributedRequestManager.java:170)
> at
> org.continuent.sequoia.controller.requestmanager.distributed.DistributedRequ
> estManager.statementExecuteQuery(DistributedRe
> questManager.java:1370)
> at
> org.continuent.sequoia.controller.virtualdatabase.VirtualDatabase.statementE
> xecuteQuery(VirtualDatabase.java:549)
> at
> org.continuent.sequoia.controller.virtualdatabase.VirtualDatabaseWorkerThrea
> d.statementExecuteQuery(VirtualDatabaseWorkerT
> hread.java:2175)
> at
> org.continuent.sequoia.controller.virtualdatabase.VirtualDatabaseWorkerThrea
> d.run(VirtualDatabaseWorkerThread.java:442)
> 2007-12-19 14:33:13,169 WARN controller.RequestManager.sunbeam-s2 An error
> occured while executing remote select request 281474977312375
> org.continuent.sequoia.common.exceptions.NoMoreBackendException
> at
> org.continuent.sequoia.controller.requestmanager.distributed.RAIDb1Distribut
> edRequestManager.execRemoteStatementExecuteQue
> ry(RAIDb1DistributedRequestManager.java:170)
> at
> org.continuent.sequoia.controller.requestmanager.distributed.DistributedRequ
> estManager.statementExecuteQuery(DistributedRe
> questManager.java:1370)
> at
> org.continuent.sequoia.controller.virtualdatabase.VirtualDatabase.statementE
> xecuteQuery(VirtualDatabase.java:549)
>
> The problem 1 seems not related to protocol, as we tried various Appia
> protocol. Do you have any suggestion for us to help investigate the issue?
>
> The problem 3 is critical as all of our clients cannot execute query even
> controller 2 is alive and running.
>
> Thanks and Regards
> Francis
>
>
>
>
>
> _______________________________________________
> Sequoia mailing list
> [email protected]
> https://forge.continuent.org/mailman/listinfo/sequoia
>
>
>
> +----------------------------------------------------------------------+
> | Z1 SecureMail Gateway Info - http://www.zertificon.com |
> +----------------------------------------------------------------------+
> | - Die Nachricht war weder verschluesselt noch digital unterschrieben |
> +----------------------------------------------------------------------+
>
>
>
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ Sequoia mailing list [email protected] https://forge.continuent.org/mailman/listinfo/sequoia
