Tim Peters wrote:
It's starting to look a lot like the Windows bind() implementation is
unreliable, sometimes (but rarely -- hard to provoke) allowing two
sockets to bind to the same (address, port) pair simultaneously,
instead of raising 'Address already in use' for one of them.  Disaster
ensues.

WRT the last version of the code I posted, on another XP Pro SP2
machine (again after playing registry games to boost the number of
ephemeral ports) I eventually saw all of:  hangs during accept(); the
assertion errors I mentioned last time; and mystery "Connection
refused" errors during connect().

The variant of the code below _only_ tries to use port 19999.  If it
can't bind to that on the first try, socktest111() raises an exception
instead of trying again (or trying a different port number).  Ran two
processes.  After about 15 minutes, both died with assert errors at
about the same time (identical, so far as I could tell by eyeball):

Process A:

Traceback (most recent call last):
  File "socktest.py", line 209, in ?
    assert msg == msg2, (msg, msg2, r.getsockname(), w.getsockname())
AssertionError: ('292739', '821744', ('127.0.0.1', 19999), ('127.0.0.1', 3845))

Process B:

Traceback (most recent call last):
  File "socktest.py", line 209, in ?
    assert msg == msg2, (msg, msg2, r.getsockname(), w.getsockname())
AssertionError: ('821744', '292739', ('127.0.0.1', 19999), ('127.0.0.1', 3846))

So it's again the business where each process is recv'ing the random
string intended to be recv'ed by a socket in the other process. Hypothesized timeline:

process A's `a` binds to 19999
process B's `a` binds to 19999 -- according to me, this should be impossible
    in the absence of SO_REUSEADDR (which acts very differently on
    Windows than it does on Linux, BTW -- on Linux this should be impossible
    even in the presence of SO_REUSEADDR; regardless, we're not using
    SO_REUSEADDR here, and the braindead hard-coded

        w.setsockopt(socket.IPPROTO_TCP, 1, 1)

    is actually using the right magic constant for TCP_NODELAY on
    Windows, as it intends).
A and B both listen()
A connect()s, and accidentally gets on B.a's accept queue
B connect()s, and accidentally gets on A.a's accept queue
the rest follows inexorably




This is what I'm experiencing as well.
I can narrow it down a bit: I *always* experience one out of two
erroneous behaviours, as described below.

I tried to make an even simpler test situation, without binding
sockets 'r' and 'w' to each other in the same process. I try to
reproduce the problem in a 'standard' socket use case, where a client
in one process binds to a server in another process.

The following two scripts acts as a server and a client.

#***********************
# sock_server_reader.py
#***********************
import socket

a = socket.socket (socket.AF_INET, socket.SOCK_STREAM)

a.bind(("127.0.0.1", 19999))
print a.getsockname()  # assigned (host, port) pair

a.listen(1)

print "a accepting:"
r, addr = a.accept()  # r becomes asyncore's (self.)socket
print "a accepted: "
print ' ' + str(r.getsockname()) + ', peer=' + str(r.getpeername())

a.close()

msg = r.recv(100)
print 'msg recieved:', msg


#***********************
# sock_client_writer.py
#***********************
import socket, random

w = socket.socket (socket.AF_INET, socket.SOCK_STREAM)
w.setsockopt(socket.IPPROTO_TCP, 1, 1)

print 'w connecting:'
w.connect(('127.0.0.1', 19999))
print 'w connected:'
print w.getsockname()
print ' ' + str(w.getsockname()) + ', peer=' + str(w.getpeername())
msg = str(random.randrange(1000000))
print 'sending msg: ', msg
w.send(msg)




There are two possible outcomes [a) and b)] of running two instances
of this client/server pair (that is, 4 processes in total like the
following).
(Numbers 1 to 4 are steps executed in chronological order.)

1) python -i sock_server_reader.py
The server prints:
        ('127.0.0.1', 19999)
        a accepting:
and waits for a connection

2) python -i sock_client_writer.py
The client prints:
        w connecting:
        w connected:
        ('127.0.0.1', 3774)
         ('127.0.0.1', 3774), peer=('127.0.0.1', 19999)
        sending msg:  903848
        >>>

and the server now accepts the connection and prints:
        a accepted:
         ('127.0.0.1', 19999), peer=('127.0.0.1', 3774)
        msg recieved: 903848
        >>>

This is like it should be. Then lets try to setup a second
client/server pair, on the same port (19999). The expected outcome of
this is that the bind() call in sock_server_reader.py should fail with
socket.error: (10048, 'Address already in use').

3) python -i sock_server_reader.py
The server prints:
        ('127.0.0.1', 19999)
        a accepting:

Already here the problem occurs, bind() is allowed to bind to a port
that is in use, in this case by the client socket 'r'.
[also on other windows ? Mikkel: yes. Diku:???]

4) python -i sock_client_writer.py
Now one out of two things happen:

a) The client prints:
        w connecting:
        Traceback (most recent call last):
          File "c:\pyscripts\sock_client_writer.py", line 7, in ?
            w.connect(('127.0.0.1', 19999))
          File "<string>", line 1, in connect
        socket.error: (10061, 'Connection refused')
        >>>
   The server waits on the call to accept(), still waiting for a
connection. (This is the blocking behaviour I reported in my first
mail, experienced when running two zope instances. The socket error
was swallowed by the unconditional except clause).

b) The client connects to the server:
        w connecting:
        w connected:
        ('127.0.0.1', 3865)
         ('127.0.0.1', 3865), peer=('127.0.0.1', 19999)
        sending msg:  119105
        >>>

and the server now accepts the connection and prints:
        a accepted:
         ('127.0.0.1', 19999), peer=('127.0.0.1', 3865)
        msg recieved: 119105
        >>>

The second set of client/server processes are now connected on the
same port as the first set of client/server processes. In a port
scanner the port now belongs two the second server process [3)].


I always get one out of these two possibilities (a and b), I never
see bind() raising socket.error: (10048, 'Address already in use').

It is important to realize that both these outcomes are an error.

I tried the same process as above on a linux system, and 3) always
raises (10048, 'Address already in use').


If case a) occured, where w.connect raises socket.error: (10061,
'Connection refused'), trying to run a third client/server pair, the
bind() call raises (10048, 'Address already in use'). The 'a'-socket
from the second pair of processes is not closed in this case, but
still trying to accept().

In my case bind() always raises (10048, 'Address already in use') when
there is an open server socket like 'a' bound to the same port.

To summarize:
Closing a server socket bound to a given port, alows another server
socket to bind to the same port, even when there are open client
sockets bound to the port.





Note that because this never tries a port number other than 19999, it
can't be a bulletproof workaround simply to hold on to the `a` socket.
 If the hypothesized timeline above is right, bind() can't be trusted
on Windows in any situation where two processes may try to bind to the
same hostname:port pair at the same time.  Holding on to `a`, and
cycling through port numbers when bind() failed, would still
potentially leave two processes trying to bind to the same port number
simultaneously (just a port other than 19999).


It would not be enough to keep a reference to 'a'. It would have to be
kept open as well. And maybe that is not a problem, since we only
accept() once - only one 'w' client socket would be able to be
accepted. Normally the use case for closing the server socket is to
disallow more connections than those already acceptet.
(But I'm not so experienced with sockets, I might be wrong.)


Ick:  this happens under Pythons 2.3.5 (MSVC 6) and 2.4.1 (MSVC 7.1),
so if it is -- as is looking more and more likely --an error in MS's
socket implementation, it isn't avoided by switching to a newer MS C
library.

Frankly, I don't see a sane way to worm around this -- it's difficult
for application code to worm around what smells like a missing
critical section in system code.

Using the simpler socket dance from the ZODB 3.4 code, I haven't yet
seen an instance of the assert failure, or a hang.  However, let two
processes run that long enough simultaneously, and it always (so far)
eventually fails with

    socket.error: (10048, 'Address already in use')

in the w.connect() call, and despite that Windows picks the port numbers here!

That is exactly what I feared could happen. As shown in my example
above, the other that might happen is that the port is 'taken over' by
the other process.


While that also smells to heaven of a missing critical section in the
Windows socket implementation, an exception is much easier to live
with / worm around.  Alas, we don't have the MS source code, and I
don't have time to try disassembling / reverse-engineering the opcodes
(what EULA <wink>?), so best I can do is run this for many more hours
to try to increase confidence that an exception is the worst that can
occur under the ZODB 3.4 spelling.

Here's full code for the "only try port 19999" version:

import socket, errno
import time, random
def socktest111():
    """Raise an exception if we can't get 19999.
    """

    a = socket.socket (socket.AF_INET, socket.SOCK_STREAM)
    w = socket.socket (socket.AF_INET, socket.SOCK_STREAM)

    # set TCP_NODELAY to true to avoid buffering
    w.setsockopt(socket.IPPROTO_TCP, 1, 1)

    # tricky: get a pair of connected sockets
    host = '127.0.0.1'
    port = 19999

    try:
        a.bind((host, port))
    except:
        raise RuntimeError
    else:
        print 'b',

    a.listen (1)
    w.setblocking (0)
    try:
        w.connect ((host, port))
    except:
        pass
    print 'c',
    r, addr = a.accept()
    print 'a',
    a.close()
    print 'c',
    w.setblocking (1)

    return (r, w)

sofar = []
try:
   while 1:
       try:
           stuff = socktest111()
       except RuntimeError:
           print 'x',
           time.sleep(random.random()/10)
           continue
       sofar.append(stuff)
       time.sleep(random.random()/10)
       if len(sofar) == 50:
           tup = sofar.pop(0)
           r, w = tup
           msg = str(random.randrange(1000000))
           w.send(msg)
           msg2 = r.recv(100)
           assert msg == msg2, (msg, msg2, r.getsockname(), w.getsockname())
           for s in tup:
               s.close()
except KeyboardInterrupt:
   for tup in sofar:
       for s in tup:
           s.close()
_______________________________________________
Zope maillist  -  Zope@zope.org
http://mail.zope.org/mailman/listinfo/zope
**   No cross posts or HTML encoding!  **
(Related lists - http://mail.zope.org/mailman/listinfo/zope-announce
 http://mail.zope.org/mailman/listinfo/zope-dev )


_______________________________________________
Zope maillist  -  Zope@zope.org
http://mail.zope.org/mailman/listinfo/zope
**   No cross posts or HTML encoding!  **
(Related lists - http://mail.zope.org/mailman/listinfo/zope-announce
http://mail.zope.org/mailman/listinfo/zope-dev )

Reply via email to