This works before forking... close all sockets directly, don't use ZMQ close in parent or children:
https://zeromq.jira.com/browse/LIBZMQ-441 On Wed, Oct 2, 2013 at 4:46 AM, Selim Ciraci <[email protected]> wrote: > Hi, > > The only solution we could find to the leaking sockets problem is to > destroy parent context before fork. Then, we re-initialize the parent > context after fork. Sometimes the context initialization fails at the > parent, somehow the router-dealer connections are not established. We are > looking at this problem now. > > Best, > Selim > > > On Wed, Sep 18, 2013 at 11:03 PM, Selim Ciraci <[email protected]> wrote: > >> Hi, >> >> Here is some more info on the error: >> >> After forking a child-child-child...child process (whose parents are >> terminated cleanly using zmq_term), zmq_connect fails. For instance: >> pid 1 forks pid 2 >> pid2, connects to server and does some work. >> pid2 asks pid 1 terminate, pid1 terminates (zmq_term() is called), >> pid2 forks id3. >> pid3 connects to server and does some work >> pid 3 asks pid2 to terminate, pid3 terminates. >> .... >> pid 10 forks pid 11 >> pid 11 tries to connect to the server, zmq_connect fails with EINVAL. >> Further trace on the error shows that the call to getaddrinfo from >> tcp_address::resolve_hostname() fails. >> >> Our code passes tcp://localhost:5555 as the address to connect (the value >> does not change, it is a constant string). The connection works on all >> child processes, until we reach a certain depth. At that point getaddrinfo >> on localhost fails with "no address associated with that name". This is >> kind of weird. I don't know what might cause this. In fact, I verified the >> parameters passed to getaddrinfo and all seems ok. >> >> On a side note, I think the sockets inherited from the parents are not >> closed. I can see the sockets in /proc/<pid>/fd (or fds, I don't remember). >> Moreover, I see that the server (with the router socket) removes the pipes >> associated with dead parent ids when the child-child-child..-child process >> terminates successfully (i.e., when it calls zmq_term). For the error in >> getaddrinfo, I think the system is running out of fds so an fd operation is >> failing. I might be wrong though. Any comments? >> >> Any help is greatly appreciated! The code I'm using is around 250000lines >> of code so it is abit hard to get a test case. But I'm working on it. >> >> Best, >> Selim Ciraci >> >> Best, >> Selim Ciraci >> >> >> On Mon, Sep 16, 2013 at 4:15 PM, Matt Connolly <[email protected]>wrote: >> >>> There's two types of sockets used by zeromq as far as I understand: >>> external connections and internal pipes used to communicate between the io >>> threads and the host application. >>> >>> My patch for zmq_term closes all of the internal pipes with new ones. >>> This allows the termination process to complete without affecting the pipes >>> that were inherited from the parent process, which caused asserts in the >>> parent. >>> >>> Returning EINTR was intended so that terminating the context would >>> behave the same as if the process received a signal. (It could be receiving >>> signals for other reasons, eg usr signal) >>> >>> If there are connected zmq sockets (to some other machine for example) >>> then those sockets would also be inherited but I thought they would have >>> been closed correctly by the termination process. This may not be working >>> right and activity on these sockets between fork and terminate in the child >>> may interfere with the parent context's ability to use these sockets. >>> Perhaps these sockets are not actually being closed properly and causing >>> this problem. >>> >>> I'll take a closer look later in the week and see... >>> >>> >>> Regards, >>> Matt. >>> >>> On 17 Sep 2013, at 8:22 am, Selim Ciraci <[email protected]> wrote: >>> >>> Hi Matt, >>> >>> Another things is, sorry if I'm wrong, but zmq_term in the child always >>> returns EINTR. This is because most of the sockets operations return EINTR >>> when pid!= getpid(). With your patch signaler will create a new eventfd >>> (correct me if I'm wrong) and then return. It is up to the reaper thread to >>> close the sockets right? but since most operations just return EINTR, I >>> wonder if the sockets are really closed after the fork. >>> >>> Best, >>> Selim Ciraci >>> >>> >>> On Mon, Sep 16, 2013 at 11:40 AM, Selim Ciraci <[email protected]> wrote: >>> >>>> Hi Matt, >>>> >>>> It is not an assertion fail. The problem occurs in connections between >>>> router-dealer sockets. The send function in router.cpp returns no route to >>>> host because it cannot find the host_id in the outpipes_t. A careful debug >>>> shows that actually the pipe from dealer to the router has not been >>>> established. I put a printf to xidentify_peer method in router.cpp, the new >>>> client ids are inserted to the outpipes_t in this method as far as I know. >>>> The aim here is compare the child process ids with the ids the router >>>> socket received. The comparison actually showed that some child ids went >>>> missing (router socket never received them). I must add that the ids went >>>> missing after a parent process terminates. Though I need further testing to >>>> prove this. >>>> >>>> Any ideas what might be going wrong here? I'm going to try to implement >>>> a simple test case. >>>> >>>> Thanks, >>>> Selim >>>> >>>> >>>> On Mon, Sep 16, 2013 at 6:13 AM, Matt Connolly <[email protected]>wrote: >>>> >>>>> Hi Selim, >>>>> >>>>> I don’t have any ideas yet about why the parent would stop sending >>>>> messages after forking a second child. >>>>> >>>>> Is it possible to reproduce this in a simple test case? >>>>> >>>>> And when the no route to host error occurs, is that an assertion? If >>>>> so, can you provide a stack trace? >>>>> >>>>> -Matt >>>>> >>>>> On 14 Sep 2013, at 6:43 am, Selim Ciraci <[email protected]> wrote: >>>>> >>>>> > Hi Matt, >>>>> > >>>>> > Thanks for your reply. I have actually found out about your patch >>>>> after the email. I have updated zmq to head from github and tried with my >>>>> program. The parent sockets seems to have closed. But the problem is every >>>>> now and then I get "no route to host" errors in zmq_send. This happens >>>>> usually when: >>>>> > parent forks a child, child calls zmq_term(parent_context) does work >>>>> and then terimantes (closes its context). >>>>> > parent in parallel uses parent_context, does work, learns the child >>>>> has terminated, forks a new child child2. >>>>> > child2 zmq_term(parent_context) does work and then terimantes >>>>> (closes its context). >>>>> > after child2 terminates parent cannot receive messages. Even though >>>>> the parent is active, zmq_send in the server fails with no route to host. >>>>> > >>>>> > I have no idea why this fails. Any ideas what might be causing this? >>>>> > >>>>> > Best, >>>>> > Selim Ciraci >>>>> >>>>> _______________________________________________ >>>>> zeromq-dev mailing list >>>>> [email protected] >>>>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev >>>>> >>>> >>>> >>> _______________________________________________ >>> zeromq-dev mailing list >>> [email protected] >>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev >>> >>> >>> _______________________________________________ >>> zeromq-dev mailing list >>> [email protected] >>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev >>> >>> >> > > _______________________________________________ > zeromq-dev mailing list > [email protected] > http://lists.zeromq.org/mailman/listinfo/zeromq-dev > >
_______________________________________________ zeromq-dev mailing list [email protected] http://lists.zeromq.org/mailman/listinfo/zeromq-dev
