Nuno Carvalho wrote:
> On Mar 26, 2007, at 5:00 , Sylvain Coutant wrote:
>> Ingo Kampe a écrit :
>>> Hi,
>>>
>>> Schnabl, Sebastian wrote:
>>>>> Detail : version is 2.10.6.
>>>> Hm, I remembered on a similar issue short time ago - but this was with
>>>> 3.0beta. Look here:
>>>> https://forge.continuent.org/pipermail/sequoia/2007-February/004791.html
>>>>
>>>> There was a problem of loosing connection between controllers while
>>>> dump-operation (heavy load if controller == db-server). But no solution
>>>> so far.
>>>>
>>>> Possible a problem with appia and high cpu-utilization ?
>>>
>>> We had problems with sequoia in high load too. It's not as robust as
>>> I would
>>> like it to. We are using sequoia 2.10.6 with appia from source for
>>> the new base
>>> view configuration.
>>>
>>> Maybe there are some timing problems in the appia.xml SEQ channel
>>> definitions. I
>>> could imagine that some timeout frames are not big enough if whole
>>> system is
>>> slow and "cluster pings" takes too long.
>>
>> Possibly. Our sequoia test controllers are slow (DB backends are not
>> on the same servers).
>> But the controller never resync and we have to put down both
>> controllers and restart everything to have them back online. A timing
>> issue would declare one controller dead at some point, but I think
>> some resync mechanism should take the hand at some point to make them
>> work together again later.
>>
>>
> 
> Yes, you are wright. You can increase the timers in the suspect
> protocol. Check this:\
> http://appia.di.fc.ul.pt/docs/javadoc/org/continuent/appia/protocols/group/suspect/SuspectSession.html#init(org.continuent.appia.xml.utils.SessionProperties)
> 
> But in the case of a real failure (not only because the system is
> loaded) the resync will be needed anyway.

May be you can give an example howto set the suspect_sweep and suspect_time
param in xml? I can't find that.

Is this a hederachannel parameter like this:
<chsession name="hederalayer">
  <parameter name="suspect_sweep">2000</parameter>
  <parameter name="suspect_time">6000</parameter>
  ...
</chsession>
??

I have done some further testing with the base view default setup and debian
sarge, jvm 1.5. If I push the load (just CPU load!) to values around 6 the group
communication breaks. Even if I set the nice level of sequoia JVM to -20 !
I'm really surprised how CPU load can break a communication of a process with
highest available priority.

on node1 I generate the load with:
for cpu in 1 2 3 4 5 6; do ( cat /dev/urandom | bzip2 >/dev/null ) & done

on node2 I get this in full_cluster.log after maximum 3 minutes:
2007-03-27 22:47:31,930 INFO  continuent.hedera.gms
Member(address=/192.168.0.150:21080, uid=192.168.0.150:21080) failed in
Group(gid=botdb)
2007-03-27 22:47:31,935 WARN  controller.virtualdatabase.botdb
Controller Member(address=/192.168.0.150:21080, uid=192.168.0.150:21080)
has left the cluster.
2007-03-27 22:47:31,939 INFO  controller.virtualdatabase.botdb 0
requests were waiting responses from
Member(address=/192.168.0.150:21080, uid=192.168.0.150:21080)
2007-03-27 22:47:31,985 INFO  controller.requestmanager.cleanup Waiting
60000ms for client of controller 1688849860263936 to failover

Does anybody have an idea what to fix here? BTW the same test with jgroups
passes without problems. Jgroups communication is stable (tested up to a load of
20).

Regards,
)ngo


--
https://www.globaltrustpoint.com/[EMAIL PROTECTED]

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
Sequoia mailing list
[email protected]
https://forge.continuent.org/mailman/listinfo/sequoia

Reply via email to