Hi, Here is some more info on the error:
After forking a child-child-child...child process (whose parents are terminated cleanly using zmq_term), zmq_connect fails. For instance: pid 1 forks pid 2 pid2, connects to server and does some work. pid2 asks pid 1 terminate, pid1 terminates (zmq_term() is called), pid2 forks id3. pid3 connects to server and does some work pid 3 asks pid2 to terminate, pid3 terminates. .... pid 10 forks pid 11 pid 11 tries to connect to the server, zmq_connect fails with EINVAL. Further trace on the error shows that the call to getaddrinfo from tcp_address::resolve_hostname() fails. Our code passes tcp://localhost:5555 as the address to connect (the value does not change, it is a constant string). The connection works on all child processes, until we reach a certain depth. At that point getaddrinfo on localhost fails with "no address associated with that name". This is kind of weird. I don't know what might cause this. In fact, I verified the parameters passed to getaddrinfo and all seems ok. On a side note, I think the sockets inherited from the parents are not closed. I can see the sockets in /proc/<pid>/fd (or fds, I don't remember). Moreover, I see that the server (with the router socket) removes the pipes associated with dead parent ids when the child-child-child..-child process terminates successfully (i.e., when it calls zmq_term). For the error in getaddrinfo, I think the system is running out of fds so an fd operation is failing. I might be wrong though. Any comments? Any help is greatly appreciated! The code I'm using is around 250000lines of code so it is abit hard to get a test case. But I'm working on it. Best, Selim Ciraci Best, Selim Ciraci On Mon, Sep 16, 2013 at 4:15 PM, Matt Connolly <[email protected]> wrote: > There's two types of sockets used by zeromq as far as I understand: > external connections and internal pipes used to communicate between the io > threads and the host application. > > My patch for zmq_term closes all of the internal pipes with new ones. This > allows the termination process to complete without affecting the pipes that > were inherited from the parent process, which caused asserts in the parent. > > Returning EINTR was intended so that terminating the context would behave > the same as if the process received a signal. (It could be receiving > signals for other reasons, eg usr signal) > > If there are connected zmq sockets (to some other machine for example) > then those sockets would also be inherited but I thought they would have > been closed correctly by the termination process. This may not be working > right and activity on these sockets between fork and terminate in the child > may interfere with the parent context's ability to use these sockets. > Perhaps these sockets are not actually being closed properly and causing > this problem. > > I'll take a closer look later in the week and see... > > > Regards, > Matt. > > On 17 Sep 2013, at 8:22 am, Selim Ciraci <[email protected]> wrote: > > Hi Matt, > > Another things is, sorry if I'm wrong, but zmq_term in the child always > returns EINTR. This is because most of the sockets operations return EINTR > when pid!= getpid(). With your patch signaler will create a new eventfd > (correct me if I'm wrong) and then return. It is up to the reaper thread to > close the sockets right? but since most operations just return EINTR, I > wonder if the sockets are really closed after the fork. > > Best, > Selim Ciraci > > > On Mon, Sep 16, 2013 at 11:40 AM, Selim Ciraci <[email protected]> wrote: > >> Hi Matt, >> >> It is not an assertion fail. The problem occurs in connections between >> router-dealer sockets. The send function in router.cpp returns no route to >> host because it cannot find the host_id in the outpipes_t. A careful debug >> shows that actually the pipe from dealer to the router has not been >> established. I put a printf to xidentify_peer method in router.cpp, the new >> client ids are inserted to the outpipes_t in this method as far as I know. >> The aim here is compare the child process ids with the ids the router >> socket received. The comparison actually showed that some child ids went >> missing (router socket never received them). I must add that the ids went >> missing after a parent process terminates. Though I need further testing to >> prove this. >> >> Any ideas what might be going wrong here? I'm going to try to implement a >> simple test case. >> >> Thanks, >> Selim >> >> >> On Mon, Sep 16, 2013 at 6:13 AM, Matt Connolly <[email protected]>wrote: >> >>> Hi Selim, >>> >>> I don’t have any ideas yet about why the parent would stop sending >>> messages after forking a second child. >>> >>> Is it possible to reproduce this in a simple test case? >>> >>> And when the no route to host error occurs, is that an assertion? If so, >>> can you provide a stack trace? >>> >>> -Matt >>> >>> On 14 Sep 2013, at 6:43 am, Selim Ciraci <[email protected]> wrote: >>> >>> > Hi Matt, >>> > >>> > Thanks for your reply. I have actually found out about your patch >>> after the email. I have updated zmq to head from github and tried with my >>> program. The parent sockets seems to have closed. But the problem is every >>> now and then I get "no route to host" errors in zmq_send. This happens >>> usually when: >>> > parent forks a child, child calls zmq_term(parent_context) does work >>> and then terimantes (closes its context). >>> > parent in parallel uses parent_context, does work, learns the child >>> has terminated, forks a new child child2. >>> > child2 zmq_term(parent_context) does work and then terimantes (closes >>> its context). >>> > after child2 terminates parent cannot receive messages. Even though >>> the parent is active, zmq_send in the server fails with no route to host. >>> > >>> > I have no idea why this fails. Any ideas what might be causing this? >>> > >>> > Best, >>> > Selim Ciraci >>> >>> _______________________________________________ >>> zeromq-dev mailing list >>> [email protected] >>> http://lists.zeromq.org/mailman/listinfo/zeromq-dev >>> >> >> > _______________________________________________ > zeromq-dev mailing list > [email protected] > http://lists.zeromq.org/mailman/listinfo/zeromq-dev > > > _______________________________________________ > zeromq-dev mailing list > [email protected] > http://lists.zeromq.org/mailman/listinfo/zeromq-dev > >
_______________________________________________ zeromq-dev mailing list [email protected] http://lists.zeromq.org/mailman/listinfo/zeromq-dev
