Interestingly the point you make about an overloaded network segment
possibly causing dropped packets might be a likely candidate as the network
is certainly more heavily utilised on that site than at any other and as we
are not in control of any of the platform configurations, network schematics
etc it is beyond our control/scope to do anything about.
I am adding extra trace in 4 additional events and logging the socket states
and error codes in the
OnSessionAvailable, OnDebugAvailable, OnClientCreate and OnSessionClosed
events.
Previously we were not connecting any event handlers to these so were unable
to log the state. Analysis by the Prime contractor on the customers site
appeared to suggest that an FD_ACCEPT message was being processed but with
an error code as they reported that using a network analysis trace the
socket initialization was started correctly but that the log entries written
in the OnClientConnect were not written.
I looked through the source and can see in the TriggerSessionAvailable
handler the line If Error 0 then Exit;
And this is done before the construction of a 'client socket' object with
which to handle the connection and also prior to the point where the
OnClientConnect method is called.
So I am guessing an error is being passed in the LParam of the message.
Hopefully by attaching the OnSessionAvailable event we might be able to
capture what this error is and the be able to understand why this site has a
particular problem.
If and when I receive these additional logs I will post any conclusions
here.
Best regards,
Damien.
-Original Message-
From: twsocket-boun...@elists.org [mailto:twsocket-boun...@elists.org] On
Behalf Of Francois PIETTE
Sent: Tuesday, August 03, 2010 5:19 PM
To: ICS support mailing
Subject: Re: [twsocket] TWSocketServer OnConnection event
I have been asked to investigate a strange issue we are encountering at
a
customer site in Mexico. I am a contractor for a company which supplied
surveillance and monitoring software based on the ICS component set. The
software runs fine on other sites with no problems encountered for over 8
months but on the site in Mexico after a matter of hours or days the
software (and or server) crashes.
The servers are all identical HP Blade servers running Windows Server 2003
vanilla installs. This is true of sites that are functioning and the ones
in
Mexico that are not.
If the software runs fine on several indentical systems and fails on a
single system, I would concentrate on what make that failing site different
because it has to be different. Fist check the service pack level. I suggest
first to verify that no malware is intercepting winsock calls. This is done
by malware to capture trafic. Then, I would check if any suspect LSP is not
installed on the system. Also check if some security products are not
interfering with winsock: they frequently intercept winsock calls to block
some kind of trafic. Those security products could be buggy.
My analysis of the problem to date suggests that an OnClientConnect is
firing but the passed Client object is incomplete or invalid. The code for
the OnClientConnect event does not check the ErrorCode and accepts the
connection but traffic appears not to flow correctly between client and
server.
I suggest checking the error code and reporting it into the logile for
analisys.
if I run
NetStat on the server it appears a windows socket object is left in
FIN-WAIT
1 or FIN-WAIT2 state. Eventually the system fails as all windows socket
objects are expended and there is a catastrophic failure of the software
and/or server.
the steps that should be taken when an error does occur to ensure that
the windows sockets are correctly 'cleaned
up' and released back to the Operating System ?
FIN-WAIT-1 and FIN-WAIT-2 means the orderly shutdown sequence is occuring
but the remote site do not answer (Have a look here:
http://www.tcpipguide.com/free/t_TCPConnectionTermination-2.htm). An orderly
shutdown is a multiple steps sequence between client and server. What is
strange here is that FIN-WAIT-1 and FIN-WAIT-2 states are client side
states, not server side. So it is possible that the socket you see in that
sate are NOT the one failing. Maybe something else is failing (maybe in the
same software) causing those sockets to be in those states and consume all
available sockets which cause trouble in the software for accepting a new
connection because accepting a new connection means creating a new socket.
So I see the possibility that some other software or another part of your
software has an issue with /client/ connection close, this result in a lot
of sockets in the FIN-WAIT-1 or FIN-WAIT-2 state, consuming all available
socket and making new connection acceptance failure.
Why those client connexions could have problems with their server not
answering ? This could be cause by malware sending forget IP packets to
break existing