Hi Stefan

Thanks for your suggestions! In fact I have similar guess. I performed same
test with the JVM socket timeout increased from 10 seconds to 30 seconds, it
seems the problem no longer repeat (will do further load test to confirm).

However, my major concern is that if such case do happened, we expects the
controller 1 is disabled and all traffic should be directed to controller 2.
Now controller 1 still accepting connection, and only throws errors when
executing message, this seems strange.

Regards
Francis
-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Stefan
Lischke
Sent: Wednesday, December 19, 2007 6:51 PM
To: Sequoia general mailing list
Subject: Re: [Sequoia] Controller can be connected but failed
executingstatement

Hi Francis,

*sarcasm* This looks like normal Sequoia behavior ;-) *sarcasm*

But have you looked at the load on the controller machines. If load is
too high and the Appia timeout is to short, you get this "failed in
group" Problem 1 behavior.

But the Problem 2 is really bad, i try to explain to you.
If Controller 2 says Controller 1 failed, it still accepts write
requests. And the write Requests are not distributed to the failed
controller. So the DB's are not in sync anymore. Now when the load gets
lower, Controller 1 automatically rejoins (he does not tell you about
this) and the following writes are distributed again. Then you have a
lot of problems, like yours.

I think this problem is by design, cause there is not splitbrain
scenario. If one controller was lost and only one write was not
distributed to him, the failed controller should NEVER rejoin the
cluster. They should work in splitbrain scenario.

This scenario does not only apply to high loads, its also a big problem
when there is a very short network split (you should never use internet
connection between two nodes, always use crosslinked network connection,
or an own network with an own switch)

I tried to talk about this problem some time ago and i opened some bugs,
but we never discussed this big issue in this community. ;-(

If its the load on your system, try to change the following values in
the appia.xml

    <chsession name="suspectl">
        <parameter name="suspect_sweep">10000</parameter>
        <parameter name="suspect_time">30000</parameter>

    </chsession>


hth stefan


_______________________________________________
Sequoia mailing list
[email protected]
https://forge.continuent.org/mailman/listinfo/sequoia

Reply via email to