Hi, The only solution we could find to the leaking sockets problem is to destroy parent context before fork. Then, we re-initialize the parent context after fork. Sometimes the context initialization fails at the parent, somehow the router-dealer connections are not established. We are looking at this problem now.
Best, Selim On Wed, Sep 18, 2013 at 11:03 PM, Selim Ciraci <[email protected]> wrote: > Hi, > > Here is some more info on the error: > > After forking a child-child-child...child process (whose parents are > terminated cleanly using zmq_term), zmq_connect fails. For instance: > pid 1 forks pid 2 > pid2, connects to server and does some work. > pid2 asks pid 1 terminate, pid1 terminates (zmq_term() is called), > pid2 forks id3. > pid3 connects to server and does some work > pid 3 asks pid2 to terminate, pid3 terminates. > .... > pid 10 forks pid 11 > pid 11 tries to connect to the server, zmq_connect fails with EINVAL. > Further trace on the error shows that the call to getaddrinfo from > tcp_address::resolve_hostname() fails. > > Our code passes tcp://localhost:5555 as the address to connect (the value > does not change, it is a constant string). The connection works on all > child processes, until we reach a certain depth. At that point getaddrinfo > on localhost fails with "no address associated with that name". This is > kind of weird. I don't know what might cause this. In fact, I verified the > parameters passed to getaddrinfo and all seems ok. > > On a side note, I think the sockets inherited from the parents are not > closed. I can see the sockets in /proc/<pid>/fd (or fds, I don't remember). > Moreover, I see that the server (with the router socket) removes the pipes > associated with dead parent ids when the child-child-child..-child process > terminates successfully (i.e., when it calls zmq_term). For the error in > getaddrinfo, I think the system is running out of fds so an fd operation is > failing. I might be wrong though. Any comments? > > Any help is greatly appreciated! The code I'm using is around 250000lines > of code so it is abit hard to get a test case. But I'm working on it. > > Best, > Selim Ciraci > > Best, > Selim Ciraci > > > On Mon, Sep 16, 2013 at 4:15 PM, Matt Connolly <[email protected]>wrote: > >> There's two types of sockets used by zeromq as far as I understand: >> external connections and internal pipes used to communicate between the io >> threads and the host application. >> >> My patch for zmq_term closes all of the internal pipes with new ones. >> This allows the termination process to complete without affecting the pipes >> that were inherited from the parent process, which caused asserts in the >> parent. >> >> Returning EINTR was intended so that terminating the context would behave >> the same as if the process received a signal. (It could be receiving >> signals for other reasons, eg usr signal) >> >> If there are connected zmq sockets (to some other machine for example) >> then those sockets would also be inherited but I thought they would have >> been closed correctly by the termination process. This may not be working >> right and activity on these sockets between fork and terminate in the child >> may interfere with the parent context's ability to use these sockets. >> Perhaps these sockets are not actually being closed properly and causing >> this problem. >> >> I'll take a closer look later in the week and see... >> >> >> Regards, >> Matt. >> >> On 17 Sep 2013, at 8:22 am, Selim Ciraci <[email protected]> wrote: >> >> Hi Matt, >> >> Another things is, sorry if I'm wrong, but zmq_term in the child always >> returns EINTR. This is because most of the sockets operations return EINTR >> when pid!= getpid(). With your patch signaler will create a new eventfd >> (correct me if I'm wrong) and then return. It is up to the reaper thread to >> close the sockets right? but since most operations just return EINTR, I >> wonder if the sockets are really closed after the fork. >> >> Best, >> Selim Ciraci >> >> >> On Mon, Sep 16, 2013 at 11:40 AM, Selim Ciraci <[email protected]> wrote: >> >>> Hi Matt, >>> >>> It is not an assertion fail. The problem occurs in connections between >>> router-dealer sockets. The send function in router.cpp returns no route to >>> host because it cannot find the host_id in the outpipes_t. A careful debug >>> shows that actually the pipe from dealer to the router has not been >>> established. I put a printf to xidentify_peer method in router.cpp, the new >>> client ids are inserted to the outpipes_t in this method as far as I know. >>> The aim here is compare the child process ids with the ids the router >>> socket received. The comparison actually showed that some child ids went >>> missing (router socket never received them). I must add that the ids went >>> missing after a parent process terminates. Though I need further testing to >>> prove this. >>> >>> Any ideas what might be going wrong here? I'm going to try to implement >>> a simple test case. >>> >>> Thanks, >>> Selim >>> >>> >>> On Mon, Sep 16, 2013 at 6:13 AM, Matt Connolly <[email protected]>wrote: >>> >>>> Hi Selim, >>>> >>>> I don’t have any ideas yet about why the parent would stop sending >>>> messages after forking a second child. >>>> >>>> Is it possible to reproduce this in a simple test case? >>>> >>>> And when the no route to host error occurs, is that an assertion? If >>>> so, can you provide a stack trace? >>>> >>>> -Matt >>>> >>>> On 14 Sep 2013, at 6:43 am, Selim Ciraci <[email protected]> wrote: >>>> >>>> > Hi Matt, >>>> > >>>> > Thanks for your reply. I have actually found out about your patch >>>> after the email. I have updated zmq to head from github and tried with my >>>> program. The parent sockets seems to have closed. But the problem is every >>>> now and then I get "no route to host" errors in zmq_send. This happens >>>> usually when: >>>> > parent forks a child, child calls zmq_term(parent_context) does work >>>> and then terimantes (closes its context). >>>> > parent in parallel uses parent_context, does work, learns the child >>>> has terminated, forks a new child child2. >>>> > child2 zmq_term(parent_context) does work and then terimantes (closes >>>> its context). >>>> > after child2 terminates parent cannot receive messages. Even though >>>> the parent is active, zmq_send in the server fails with no route to host. >>>> > >>>> > I have no idea why this fails. Any ideas what might be causing this? >>>> > >>>> > Best, >>>> > Selim Ciraci >>>> >>>> _______________________________________________ >>>> zeromq-dev mailing list >>>> [email protected] >>>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev >>>> >>> >>> >> _______________________________________________ >> zeromq-dev mailing list >> [email protected] >> http://lists.zeromq.org/mailman/listinfo/zeromq-dev >> >> >> _______________________________________________ >> zeromq-dev mailing list >> [email protected] >> http://lists.zeromq.org/mailman/listinfo/zeromq-dev >> >> >
_______________________________________________ zeromq-dev mailing list [email protected] http://lists.zeromq.org/mailman/listinfo/zeromq-dev
