Fellow ServiceMix users and developers,

I am experiencing some rather odd behavior when using the JMS Flow in a load
balanced environment, that is, where http traffic is round robined between
two HttpSoapConnectors running in separate ServiceMix containers. I'll
illustrate the course of events below.

1. Start up two separate ActiveMQ 3.2.1 instances on two separate hosts,
each pointing to the other via the <networkConnector> directive in the
conf/activemq.xml file.
2. Startup two separate ServiceMix 2.x instances on two separate hosts, each
pointing to both of the ActiveMQ instance started up in (1) via the
flowName="jms?jmsURL=reliable(tcp://activeMQ_A, tcp://activeMQ_B)".
3. Deploy the same service assembly to each ServiceMix installation. The SA
will create an HttpSoapConnector and SaajComponent.
4. From a SOAP+HTTP client, connect to
http://serviceMix_A/HttpSoapConnector, which will take the SOAP message and
send it to the SaajComponent. The SaajComponent will invoke a specified web
service and then return the result back to the HttpSoapConnector, and
finally back to the SOAP+HTTP client. The expected response is received.
5. Execute same flow from (4), but point to
http://serviceMix_B/HttpSoapConnector. The expected response is received.
6. Kill serviceMix_A, the re-execute (5). The expected response may be
returned, OR the call may hang with no visible errors.

Things that I have noticed during steps (4) and (5):
a. The debug output indicates that service execution is not confined to the
container hosting the HttpSoapConnector being posted to. That is to say that
any node of the cluster might execute the SaajComponent invocation; even
though the SaajComponent service is deployed locally in the same container
hosting the invoked HttpSoapConnector, a remote instance of the
SaajComponent hosted by another node of the cluster might be invoked.
(Please note that the ServiceMix instance have different named in the
servicemix.xml file). Which instance of the SajjComponent that is invoked
seems to be based upon which instance was first invoked (i.e. which
ServiceMix instance was first part of the exchange).

b. When one of the two nodes is shutdown, and the other node's
HttpSoapConnector is invoked, the SaajComponent local to that active node
may not ever be called; the JMS Flow seems to attempt an invocation of the
SaajComponent hosted by the downed node. This can be verified by calling the
same service multiple times: If the first invocation results in a timeout
(because the cluster is trying to utilize the downed node), and the downed
node is re-started, subsequent calls will succeed.

c. If you kill one of the ActiveMQ instances once you enter the hung state,
the flow will pick up correctly once the active ServiceMix node has
failed-over to the secondary ActiveMQ node. So you would then have one
ActiveMQ instance and one ServiceMix instance. Once you have killed and
restarted the ActiveMQ instance, the problem may be very hard to recreate.
It only seems to happen when the environment is brought up from scratch.

This behavior makes no sense. I would expect that if a node of a cluster was
shutdown BEFORE any calls were made, that all of the calls to the cluster
would be honored by the active nodes. There are no exceptions being thrown
by the bus, and no exceptions being thrown by ActiveMQ. As it is (with an
Oracle DS in place for ActiveMQ OR with Derby), the JMS Flow results in a
more fragile environment than simply running with SEDA flow and HTTP load
balancing provided by Apache. I cannot imagine that this behavior is by
design, but don't have any more information to help diagnose the error.

Moreover, I am confused by the execution of services on a remote note, when
the local node itself has a copy of the service. I would think that the
local copy would be selected first; is there someway to enforce a
"LOCAL_FIRST" policy, or is it not something that can be easily done (I know
if two policies: RandomChoice and FirstChoice)? Does the bus know if a
certain service is local or remote? I ask this question because with the
current behavior (which sometimes selects remote over local), you cannot
effectively load balance across multiple ServiceMix nodes when they are
deployed ccntrally; service execution will take place on, at best, a subset
of nodes (and that subset might only be one node).

In summary, I have two questions herein: 
[1] How can you control service selection and execution in a cluster?
[2] Why are the JMS Flow fragile to the point of haveing to bounce various
parts when a node dies? Put another way, why can't I add and remove
ServiceMix nodes without haveing to worry about re-starting one or more of
the ActiveMQ nodes.

regards,
/jonathan

Reply via email to