Hi,
I have a Problem with the stability of the connection from an AgentXSubAgent
to the MasterAgent. The SubAgent is used in a System running 24/7 (RHEL
kernel 2.6.18, NET-SNMP version: 5.3.2.2). Upon disconnection (detected via a
TransportStateListener) a reconnection is Performed. However due to a race
condition within snmp4j the connection is not properly re-established leaving
the system with 2 connections (one with an established AgentX session and one
"plain TCP connection"). However the connection with the established AgentX
Session is not used, because the DefaultTcpTransportMapping only memorizes
the "last" connection to a certain address.
Details:
The following sequence of events can be observed from the log files (snmp4j
logging has been activated and set to DEBUG level for analyzing this issue):
At some point in time the "request=>response" pattern gets "out of sync" and
we experience a pattern in the form "request1, request2 => response1,
response2". Right after an occurrence of such a pattern the connection seems
to be dropped by the peer. While re-establishing the connection, snmp4j is
running into race-condition preventing a proper re-connection.
For the sake of readability the following log output is manually filtered to
contain only the relevant pieces. Lines beginning with "///" describe the
relevant pieces from the removed log output.
02:51:04,807 DEBUG [DefaultTCPTransportMapping_0.0.0.0/0]
org.snmp4j.transport.DefaultTcpTransportMapping: Key is readable
/// Reading message containing GetPDU with transactionID=13485
02:51:04,807 DEBUG [DefaultTCPTransportMapping_0.0.0.0/0]
org.snmp4j.transport.DefaultTcpTransportMapping: Key is readable
/// Reading message containing GetPDU with transactionID=13487
02:51:04,811 DEBUG [DefaultTCPTransportMapping_0.0.0.0/0]
org.snmp4j.transport.DefaultTcpTransportMapping: Key is writable
/// Sending message containing response with transactionID=13485
02:51:04,811 DEBUG [DefaultTCPTransportMapping_0.0.0.0/0]
org.snmp4j.transport.DefaultTcpTransportMapping: Key is readable
/// Reading message containing GetPDU with transactionID=13489
02:51:04,812 DEBUG [DefaultTCPTransportMapping_0.0.0.0/0]
org.snmp4j.transport.DefaultTcpTransportMapping: Key is writable
/// Obviously trying to send response for transactionID=13487
02:51:04,812 WARN [DefaultTCPTransportMapping_0.0.0.0/0]
org.snmp4j.transport.DefaultTcpTransportMapping: java.io.IOException: Broken
pipe
/// The broken pipe oviously happened when trying to send data.
/// Now the reconnector is kicking in
02:51:04,813 DEBUG [SubAgentXImpl periodic reconnector]
org.snmp4j.transport.DefaultTcpTransportMapping: Socket for address
'127.0.0.1/705' is closed, opening it...
/// Obviously at the same time snmp4j tries to send the last pending
outgoing message (response for transactionID=13489)
02:51:04,813 DEBUG [AgentXSubAgent.0]
org.snmp4j.transport.DefaultTcpTransportMapping: Socket for address
'127.0.0.1/705' is closed, opening it...
///Then the connection for the first opened socket succeeds (local port
56565)
02:51:04,813 DEBUG [DefaultTCPTransportMapping_0.0.0.0/0]
org.snmp4j.transport.DefaultTcpTransportMapping: Key is connectable
///Requesting to write AgentX-Open-PDU
02:51:04,813 DEBUG [DefaultTCPTransportMapping_0.0.0.0/0]
org.snmp4j.transport.DefaultTcpTransportMapping: Key is writable
///Write AgentX-Open-PDU
02:51:04,814 DEBUG [DefaultTCPTransportMapping_0.0.0.0/0]
org.snmp4j.transport.DefaultTcpTransportMapping: Key is readable
///Reading response => AgentX Session opened successfully.
/// Then the connection for the second opened socket succeeds (local port
56566)
02:51:04,814 DEBUG [DefaultTCPTransportMapping_0.0.0.0/0]
org.snmp4j.transport.DefaultTcpTransportMapping: Key is connectable
///This socket now overwrites the first socket (on which the successfully
opened session is relying) within the socket map in
DefaultTcpTransportMapping
///Now the last pending outgoing message from the previous session is sent
on the second opened socket
///As well as the first AgentX-Register-PDU for the new session (again via
the second opened socket).
///The following log entry clearly shows that snmp4j is using the second
opened socket (port 56566) for registering instead of the one on which the
session was opened (port 56565)
02:51:04,815 DEBUG [SubAgentXImpl periodic reconnector]
org.snmp4j.transport.DefaultTcpTransportMapping: Looking up connection for
destination '127.0.0.1/705' returned:
SocketEntry[peerAddress=127.0.0.1/705,socket=Socket[addr=/127.0.0.1,port=705,
localport=56566],lastUse=Tue Mar 15 02:51:04 UTC 2011]
>From there on all outgoing messages (for registering MOs) are sent via the
second socket where NO Session has been established. Therefore the
MasterAgent only responds with 257 (no session open).
So we have several problems here:
- A response PDU for a "broken session" opens a new socket, that will
"overwrite" the socket of a newly created session.
- A session does not have an associated socket but only an associated
address.
- The TransportMapping does not know anything about the session to which a
packet belongs and therefore should NOT be allowed to choose the socket on
which to send.
I clearly see, that it will definitely not be a minor patch to fix these
problems resulting in the race condition described above. However I am now
trying to defer the reconnection process by 5 seconds. This is however only a
workaround. This will allow any pending outgoing messages to be sent
(possibly opening a new socket). After 5 seconds all pending tasks should be
finished and a new socket connection may already be established which will
then in turn be used to open the new session.
Regarding the loss of the connection: EACH connection loss follows the same
pattern. The "request1 => response1, request2 => response2" pattern gets out
of sync ("request1, request2 => response1, response2") and then obviously the
MasterAgent closes the connection. This however (without having read the
AgentX RFC) appears to me as a valid scenario that is not correctly handled
by the MasterAgent. Does anybody of you happen to know if such a sequence of
requests and responses is permissible in the AgentX Protocol?
- If yes, then the connection loss is obviously triggered by a bug in
NET-SNMP.
- If not, then snmp4j MUST not read a message from the socket before the
response for the last read message is still pending.
Best regards
Stephan
_______________________________________________
SNMP4J mailing list
[email protected]
http://lists.agentpp.org/mailman/listinfo/snmp4j