Hi Pau, The system-wide critical section is currently implemented using a win32 Event which, as you observed, has the possibility of resulting in a deadlock in the following situation: 1) Process A takes the Event 2) Process B tries to take the Event and blocks 3) Process A aborts within the critical section (due to an assertion being raised) 4) Since Process B has opened the Event, the OS will not clean up the Event. 5) Process B and any subsequent process will now block forever for the Event.
As I mentioned in the previous mail, if the critical section were to be implemented using a Mutex instead, then in step 5, Process B would be able to enter the critical section with a return code of WAIT_ABANDONED from WaitForSingleObject. (Or at least that's what I read from MSDN) Note: If Process A aborted due to some exhaustion of resources, then Process B would likely hit the same assertion too. The question is how to convert the Event to a Mutex and yet not break compatibility with existing applications using older versions of the library. On Wed, Feb 13, 2013 at 3:28 PM, Pau <[email protected]> wrote: > Hi, > > I am back with the asserts happening inside a critical section in > signaler.cpp. > The problem still is that in signale.cpp make_fdpair(..) creates > system-wide critical section and does a number of things that can generate > a wsa_assert() or win_assert() before releasing the session. > > I have seen that in the trunk someone has added a CloseHandle(sync) at the > end of the function, I do not know if it had something related with this > but I understand that the problem is still there. wsa_assert() and > wsa_windows() end up in RaiseException (0x40000015, > EXCEPTION_NONCONTINUABLE, 1, extra_info) which I understand is a cul de sac > that has no way out to clean up before leaving. > > I guess we need a special assert function to use inside this critical but > I'd like a more documented opinion (Kiu?). > > thanks, > > Pau Ceano > > El 21/01/2013 23:37, KIU Shueng Chuan escribió: > > Hi Pau, a fix for the assertion on connection to port 5905 is in trunk > branch. > > I think the dangling critical section possibility could be fixed by > changing the Event to a Mutex. When an assertion occurs the mutex would > just be abandoned. However this change will cause backward compatibility > issues with older versions. > On Jan 22, 2013 2:04 AM, "Pieter Hintjens" <[email protected]> wrote: > >> Hi Pau, >> >> So there are two different problems here, one is that we're hitting a >> socket limit on WXP, and the other is that the asserts are happening >> inside a critical section. >> >> I don't think we can fix the first one easily but we can presumably >> assert in a smarter way. Do you want to try making a patch for this? >> >> -Pieter >> >> On Mon, Jan 21, 2013 at 6:23 PM, Pau <[email protected]> wrote: >> > >> > Hi, >> > >> > >> > I am using (not yet in production) ZMQ on Windows and I have found what >> > I think is a big problem for Windows users. >> > We use WXP and W7 and Visual C++ different versions. ZMQ version 3.2.0 >> > (as far as I see the same problem happens in 3.2.2) >> > >> > I do not fully understand ZMQ internals but I've seen that every time a >> > socket is created the function make_fdpair(..) is called and in >> > signaler.cpp(line244) a system event "zmq-signaler-port-sync" is >> created. >> > This event is used as a system-wide critical section and, so all >> > applications that try to create an event will WaitForSingleObject (sync, >> > INFINITE) until SetEvent (...) is called. >> > The problem is that the code between: >> > HANDLE sync = CreateEvent (NULL, FALSE, TRUE, TEXT >> > ("zmq-signaler-port-sync")); >> > and >> > SetEvent (sync); >> > is full of wsa_asserts(..) that will terminate the application if >> > something goes wrong. >> > >> > It is clear that terminating the application not leaving the system-wide >> > critical section is a bad idea because all applications in the system >> > will hang and you will have to stop all them to start again. >> > I understand that no errors should happen but anyway to escape from the >> > error is not a good idea in this case. >> > >> > I do not know all possible reasons to generate a fatal wsa_assert(..) >> > but there is at least one: >> > >> > I have seen that in XP it is possible that line 301 rc = connect (*w_, >> > (sockaddr *) &addr, sizeof (addr)); returns an error when a socket tries >> > to connect to 5905 and this has happened many times. >> > Windows uses port numbers starting near 1400 and XP has a limit at 5000. >> > In W7 this does not look as a problem because maximum is 65000 >> > It sounds as if the number was big enough but apart from the fact that >> > ZMQ uses a big number of connections (at least in my tests) I have >> > experienced that Windows jumps port numbers by 7, so 5000 happens >> > sometimes with catastrophic consequences. >> > >> > best, >> > >> > Pau Ceano >> > _______________________________________________ >> > zeromq-dev mailing list >> > [email protected] >> > http://lists.zeromq.org/mailman/listinfo/zeromq-dev >> _______________________________________________ >> zeromq-dev mailing list >> [email protected] >> http://lists.zeromq.org/mailman/listinfo/zeromq-dev >> > > > _______________________________________________ > zeromq-dev mailing > [email protected]http://lists.zeromq.org/mailman/listinfo/zeromq-dev > > > > _______________________________________________ > zeromq-dev mailing list > [email protected] > http://lists.zeromq.org/mailman/listinfo/zeromq-dev > >
_______________________________________________ zeromq-dev mailing list [email protected] http://lists.zeromq.org/mailman/listinfo/zeromq-dev
