Re: Cluster fixes - Need Coordination of work

Peter Rossbach Sat, 16 Apr 2005 01:26:42 -0700

Hey Filip,

very welcome that you help.

Filip Hanik - Dev lists schrieb:

I ran some load tests with the pooled mode and the clustering stats are looking good. next week I am expecting to dig a little bit deeper into the code, but so far it is looking pretty good,

Well, that a very fine news.

I am getting an increased number of incomplete responses, such as 302 redirects from tomcat, but that can also be the load balancer or the client scrambling the headers making an incomplete request.

I have tested with the mod_jk 1.2.10 load balancing, Apache 2.0.52/53 (Wndows XP,Suse 9.1) and start next week some tests with Cisco LB in combination with a lot of Apaches/Tomcat ( 8 Apache and every host a 3 cluster tomcats domain ).

I don't see those 302.

I am glad you removed the compress flag, I am not sure what that was to begin with as if I remember it correctly, messages were already being compressed, and during profiling, this had little impact on performance

On my profiling the compress mode is only usefull when you have large replication messages ( > 8k bytes), but it use more CPU performance (> 20-30% more). I don't remove the compress flag. I have disable it by default. It is a sender/receiver attribute. The attribute waitForAck and compress was transfered to the Receiver:

<Receiver className="org.apache.catalina.cluster.tcp.SocketReplicationListener" tcpListenAddress="@node.clustertcp.address@" tcpListenPort="@node.clustertcp.port@" doReceivedProcessingStats="true" /> <Sender className="org.apache.catalina.cluster.tcp.ReplicationTransmitter" replicationMode="fastasyncqueue" compress="true" doTransmitterProcessingStats="true" doProcessingStats="true" doWaitAckStats="true" queueTimeWait="true" queueDoStats="true" queueCheckLock="true" ackTimeout="15000" waitForAck="true" autoConnect="false" keepAliveTimeout="80000" keepAliveMaxRequestCount="-1"/>

One of my ideas is:

Change the cluster protocol that developer can add there own data serialzable/deserialzable format (high risk)

Currently header 6 bytes (FLT2002) data.length 4 bytes data, end header 6 bytes (TLF2003) Optimized to header 2 bytes (TC) type 1 byte compressflag 1 byte data.length 4 bytes, data | <real uncompressed data.length (4 bytes)> data "type" means user defined type and receiver extract bytes and type and sende it to callback s. ObjectReader or SocketObjectReader compress 1 first data 4 data bytes are the real uncompressed data length. ( Is for better memory management atr recevier side, S. XByteBuffer) overwrite ClusterSender and ClusterReceiver deserizable/seriazable methods

- Then we can setup a flag at ClusterMessage or make a on the fly decision to compress data.

when changing the code, I was wondering if we can stick to method names that make sense and are logical

public int getTimeoutAllSession() If this means return the count of all sessions that have timed out, I would suggest public int getSessionTimeoutCount()

No, it is the value of the timeout in sec's that DeltaManager wait after send all session event to one other cluster member.

protected ClusterMessage createRecevierObject(byte[] data) do you mean deserialize? as in protected ClusterMessage deserialize(byte[] data)

Yes, I have change the names at ClusterReceiverBase and ReplicationTransmitter. Also my favorit names, but time is limit when you refactor code

I must admit that I am having a little bit of a hard time reading the code because of the funky naming conventions, do you mind me cleaning up some when I go in and add changes?

Yes, feel free to find better names. Please, change the names also inside the mbeans descriptors and testcode. I thing we must coordinate the work. You announce the change name step, than I can stop my redesign and refactorings.

I will be pushing for stabilization as opposed to new features and so called "refactoring". As an example, to customers stability and speed is more important than features, take MySQL for example.

Yes, you are right. But my code changes are important for better understanding and made a clearer semantic to a lot of classes. Other thing is: I want made the cluster faster and easier to extend. I hope we can also port the Remy/Mladens APR sockets to the clustering module.

The following cases/classes need help:

- SimpleTcpCluster pause/resume senders You also mean that pause Receiver help? Then you must also stop the Membership and that is dangerous. => pause: We can send a message to all other nodes that we are member but please queue all messages for us. We also queue all message from local node. => resume: Send all nodes that the now can send the queue and we start also the sending.

Hmm: The async senders can handle it Currently the async Socket Sender stop the thread when you call disconnect, when you call connect a new thread starts and all queued message are send.

- PooledSocketSender
  - extend JMX stats
  - pause/resume sockets

DeltaManager
  - expire sessions
        The processExpire send for every session one message
        All 60 Sec the cluster send a lot of those messsages.
      => Better calc all sessions an send on big expire message package.

- Restarting node szenario is flacky.... You wait for GETALLMessage and other node send Sessions Events. (BAD) You can get a Session Delta before Session exists.... => I thing before State is not transfer we mus Queue those messages from other cluster nodes. - Send All Session to more then one messages 1000 Sessions per message After the complete active sessions transfer send a spezial State Transfer message.

documentation Wrote a new How to and add sample config => I have implement a very fine cluster template and checkin it in this weekend.

Your ClusterSessionListener server.xml change is not needed. At cluster starts a ClusterSessionListener was created, when no other listener is configured.

Peter

Filip
Peter Rossbach wrote:
Yes, I have change a lot and it is time to test and stabilze the code.
   s. to-do.txt for more....
The current cluster code with 5.5.9 fix pack work very well I testet the fix under very high load last week
Peter
- Great that you also start to look inside the code.
Filip Hanik - Dev Lists schrieb:
I am going through the cluster code right now and will be adding fixes along the way. I think the development of this code has focused more on features than stability, so I would like to ask that for the next period, lets focus on the stability and get this beast back in shape again.
Filip
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Cluster fixes - Need Coordination of work

Reply via email to