Re: [zeromq-dev] need advice! hitting assertion in epoll.cpp:131 (zmq 4.2.1) when going from RHEL6 to RHEL7?!
On 16.02.2017 16:26, Luca Boccassi wrote: What's the file limit on the 2 systems? (With the user that runs the program) ulimit -n on both 6.8 and 7.3: development environment: ulimit -n = 1024 installed environment: ulimit -n = 4096 With a basic sampling of file descriptors while true; do lsof -P -M -l -n -d '^cwd,^err,^ltx,^mem,^mmap,^pd,^rtd,^txt' -p -u $USER -a | wc -l sleep 2 done the total number of file descriptors for $USER increases only by about 600. ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org https://lists.zeromq.org/mailman/listinfo/zeromq-dev
Re: [zeromq-dev] need advice! hitting assertion in epoll.cpp:131 (zmq 4.2.1) when going from RHEL6 to RHEL7?!
Are you building your own binaries in both cases? yes What polling mechanism was RHEL 6 using? You can see it in the ./configure output: "Using 'epoll' polling system" from config.log: Using 'epoll' polling system with CLOEXEC ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org https://lists.zeromq.org/mailman/listinfo/zeromq-dev
[zeromq-dev] need advice! hitting assertion in epoll.cpp:131 (zmq 4.2.1) when going from RHEL6 to RHEL7?!
Hello, I could use some advice to diagnose the following issue. I have a program that has been running without problems for a couple of years on Red Hat Enterprise Linux 6 at various sites. On RHEL7, the program triggers the assertion Bad file descriptor (src/epoll.cpp:131) in about 1/3 of executions, during startup (sometimes during shutdown). Less often, I see Bad file descriptor (src/epoll.cpp:100) The problem persists after upgrading to ZeroMQ 4.2.1 from 4.1.6. I don't get it! Programming errors aside, I do check all return codes and log errors as they occur in the main thread, and there is nothing until libzmq commits suicide from one of its threads. Any idea/advice on how I could track down this problem? What makes RHEL7 different enough from RHEL6 to emerge this kind of errors? Cheers :-( GDB BACKTRACE FROM CORE FILE: Thread 3 (Thread 0xf736b900 (LWP 5039)): #0 0xf7751430 in __kernel_vsyscall () #1 0xf745694b in poll () from /lib/libc.so.6 #2 0xf6ff5457 in zmq::socket_poller_t::wait(zmq::socket_poller_t::event_t*, int, long) () from $TOP/lib/platform/libzmq.so.5 #3 0xf6ff325f in zmq_poller_wait_all(void*, zmq_poller_event_t*, int, long) () from $TOP/lib/platform/libzmq.so.5 #4 0xf6ff3aa5 in zmq_poller_poll(zmq_pollitem_t*, int, long) () from $TOP/lib/platform/libzmq.so.5 #5 0xf6ff2bb1 in zmq_poll () from $TOP/lib/platform/libzmq.so.5 #6 0xf702cec1 in zt_reactor_loop (r=) at $TOP/src/reactor.c:268 (...) #17 0x080487da in main () Thread 2 (Thread 0xf6e6db40 (LWP 5066)): #0 0xf7751430 in __kernel_vsyscall () #1 0xf7463a16 in epoll_wait () from /lib/libc.so.6 #2 0xf6fa17d0 in zmq::epoll_t::loop() () from $TOP/lib/platform/libzmq.so.5 #3 0xf6fa1a35 in zmq::epoll_t::worker_routine(void*) () from $TOP/lib/platform/libzmq.so.5 #4 0xf6fe36f2 in thread_routine () from $TOP/lib/platform/libzmq.so.5 #5 0xf7574b2c in start_thread () from /lib/libpthread.so.0 #6 0xf746308e in clone () from /lib/libc.so.6 Thread 1 (Thread 0xf666cb40 (LWP 5067)): #0 0xf7751430 in __kernel_vsyscall () #1 0xf739a1f7 in raise () from /lib/libc.so.6 #2 0xf739ba33 in abort () from /lib/libc.so.6 #3 0xf6fa2726 in zmq::zmq_abort(char const*) () from $TOP/lib/platform/libzmq.so.5 #4 0xf6fa164b in zmq::epoll_t::set_pollout(void*) () from $TOP/lib/platform/libzmq.so.5 #5 0xf6fa3951 in zmq::io_object_t::set_pollout(void*) () from $TOP/lib/platform/libzmq.so.5 #6 0xf6fdafe1 in zmq::stream_engine_t::restart_output() () from $TOP/lib/platform/libzmq.so.5 #7 0xf6fcae20 in zmq::session_base_t::read_activated(zmq::pipe_t*) () from $TOP/lib/platform/libzmq.so.5 #8 0xf6fb9dd3 in zmq::pipe_t::process_activate_read() () from $TOP/lib/platform/libzmq.so.5 #9 0xf6fb2a9e in zmq::object_t::process_command(zmq::command_t&) () from $TOP/lib/platform/libzmq.so.5 #10 0xf6fa3f77 in zmq::io_thread_t::in_event() () from $TOP/lib/platform/libzmq.so.5 #11 0xf6fa1948 in zmq::epoll_t::loop() () from $TOP/lib/platform/libzmq.so.5 #12 0xf6fa1a35 in zmq::epoll_t::worker_routine(void*) () from $TOP/lib/platform/libzmq.so.5 #13 0xf6fe36f2 in thread_routine () from $TOP/lib/platform/libzmq.so.5 #14 0xf7574b2c in start_thread () from /lib/libpthread.so.0 #15 0xf746308e in clone () from /lib/libc.so.6 ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org https://lists.zeromq.org/mailman/listinfo/zeromq-dev
[zeromq-dev] setting the CLOEXEC flag for eventfd and epoll instances?
while investigating a problem involving fork() and zeromq, I found some file descriptor leaks. 1. The function zmq::epoll_t::epoll_t (const zmq::ctx_t _) in src/epoll.cpp creates an epoll instance with epoll_fd = epoll_create(1); SUGGESTION: replace with epoll_fd = epoll_create1(EPOLL_CLOEXEC); 2. The function zmq::signaler_t::make_fdpair (fd_t *r_, fd_t *w_) in src/signaler.cpp creates an eventfd object with #if defined ZMQ_HAVE_EVENTFD fd_t fd = eventfd (0, 0); SUGGESTION: replace with #if defined ZMQ_HAVE_EVENTFD fd_t fd = eventfd (0, EFD_CLOEXEC); ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org http://lists.zeromq.org/mailman/listinfo/zeromq-dev
Re: [zeromq-dev] What is the canonical handling of zeromq sockets when fork+exec?
On 25.11.2016 11:50, Luca Boccassi wrote: What I can say is that we have a unit test for this situation: https://github.com/zeromq/libzmq/blob/master/tests/test_fork.cpp And the child closes the (TCP) socket explicitly before the context. Which is in fact what should happen in all cases. The parent then can receive messages on the sockets just fine. Maybe it's a linger issue? By default a socket has 30s of linger grace period. Try setting ZMQ_LINGER to 0 in the socket in the child, close the socket and then terminate the context perhaps. thanks. Formatted differently: 1. zmq_close sockets in child (perhaps setting ZMQ_LINGER to 0 beforehand) 2. zmq_term context in child and only then 3. close rest of file descriptors in child The reason I went directly to point 3 is this line from the man page of fork(2): The child process is created with a single thread—the one that called fork(). (see http://man7.org/linux/man-pages/man2/fork.2.html) Michael Kerrisk in "The Linux Programming Interface" insists: When a multithreaded process calls fork(), only the calling thread is replicated in the child process. (The ID of the thread in the child is the same as the ID of the thread that called fork() in the parent.) All of the other threads vanish in the child; no thread-specific data destructors or cleanup handlers are executed for those threads. (...) Of course, that's where I run into the problem?! ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org http://lists.zeromq.org/mailman/listinfo/zeromq-dev
[zeromq-dev] What is the canonical handling of zeromq sockets when fork+exec?
* Background I have a service that starts workers on demand with fork+exec. The requests arrive over zeromq sockets. After the fork, before the exec, I close all file descriptors > 2, keeping only stdin/out/err. I then exec the requested program. * Problem It works. Except that I get some rare core dumps (of the service) with the following assertion failure: Bad file descriptor (src/epoll.cpp:90) and the backtrace: #0 0xf77f5430 in __kernel_vsyscall () #1 0xf743f1f7 in raise () from /lib/libc.so.6 #2 0xf7440a33 in abort () from /lib/libc.so.6 #3 0xf7067134 in zmq::zmq_abort(char const*) () from $LIBS/libzmq.so.5 #4 0xf7065e6c in zmq::epoll_t::rm_fd(void*) () from $LIBS/libzmq.so.5 #5 0xf7068823 in zmq::io_object_t::rm_fd(void*) () from $LIBS/libzmq.so.5 #6 0xf70958af in zmq::stream_engine_t::unplug() () from $LIBS/libzmq.so.5 #7 0xf7098711 in zmq::stream_engine_t::error(zmq::stream_engine_t::error_reason_t) () from $LIBS/libzmq.so.5 #8 0xf7098867 in zmq::stream_engine_t::timer_event(int) () from $LIBS/libzmq.so.5 #9 0xf707f972 in zmq::poller_base_t::execute_timers() () from $LIBS/libzmq.so.5 #10 0xf7066209 in zmq::epoll_t::loop() () from $LIBS/libzmq.so.5 #11 0xf7066467 in zmq::epoll_t::worker_routine(void*) () from $LIBS/libzmq.so.5 #12 0xf709d67e in thread_routine () from $LIBS/libzmq.so.5 #13 0xf7619b2c in start_thread () from /lib/libpthread.so.0 #14 0xf750808e in clone () from /lib/libc.so.6 This is with zeromq-4.1.4 on RHEL 7.3 x86_64. So I wonder: is there some interaction between parent and child? * Documentation The Guide and the FAQ do not address explicitly the fork+exec point. The question has been asked several times on the mailing list in various forms, without a definitive answer (for dummies like me at least). * Questions: Do I need to zmq_close the sockets in the child? Or is zmq_term in the child enough? Does closing the file descriptors in the child cause problems in the parent? What is the correct way to handle this? ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org http://lists.zeromq.org/mailman/listinfo/zeromq-dev
Re: [zeromq-dev] Assertion failed: !more (src/fq.cpp:117) = fix works!
Hi Doron, I tried out your latest repo https://github.com/somdoron/libzmq/commit/3775d0853a8c1f1c3854a94c7fe12e78046faeca with the changes to src/socket_base.cpp, src/pipe.cpp and src/pipe.hpp. I confirm that the problem reported at https://github.com/zeromq/libzmq/issues/2163 is solved: the assertion in src/fq.cpp:118 is not triggered anymore. Thanks! There is one point I'm unsure about. The documentation of zmq_disconnect states Any outstanding messages physically received from the network but not yet received by the application with _zmq_recv()_ shall be discarded. https://github.com/zeromq/libzmq/blob/master/doc/zmq_disconnect.txt#L19-L20 In the PUB->SUB test case, the message is not discarded by the zmq_disconnect, and is received (sans abort) with zmq_msg_recv. Doesn't this behavior contradict the doc? ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org http://lists.zeromq.org/mailman/listinfo/zeromq-dev
Re: [zeromq-dev] about Assertion failed: !more (src/fq.cpp:117): just remove the assertion?
On 20.10.2016 15:46, Doron Somech wrote: I actually those are different issues. If you suffer from the pubsub issue, I think I trace the bug and have a solution. Take a look at the issue. If you suffer from the 100K issue, I think that is a different one, anyway you can try the solution as well. It might be related to another pipe.terminate call just tweaked the test case of issue 2163 for a message with a single frame. No assertion, the program completes. There seems to be something broken in PUB -> SUB, multi-part messages, and zmq_disconnect. ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org http://lists.zeromq.org/mailman/listinfo/zeromq-dev
Re: [zeromq-dev] about Assertion failed: !more (src/fq.cpp:117): just remove the assertion?
On 20.10.2016 14:31, Doron Somech wrote: Also I think it smells like using socket from multiple threads... unfortunately no, the assertion strikes in a single thread application. See also the test case at https://github.com/zeromq/libzmq/issues/2163 On Thu, Oct 20, 2016 at 3:28 PM, Doron Somechwrote: It don't think it is correct, there might be an issue we need to solve. The workaround is not to use multi-part messages... I like multi-part messages! The workaround would require me to do the framing myself, which would be tricky, considering the advanced state of the application. Thanks anyway. ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org http://lists.zeromq.org/mailman/listinfo/zeromq-dev
[zeromq-dev] about Assertion failed: !more (src/fq.cpp:117): just remove the assertion?
Hi, Ranjeet Kumar described his troubles with the assertion in http://lists.zeromq.org/pipermail/zeromq-dev/2016-September/030839.html According to http://lists.zeromq.org/pipermail/zeromq-dev/2016-September/030851.html he apparently solved his problem by commenting the assertion out. 1. Could someone who understands that part of the code advise on the safety of removing that assertion? 2. Is the assertion even correct? Thanks ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org http://lists.zeromq.org/mailman/listinfo/zeromq-dev
Re: [zeromq-dev] BUG: SUB socket + multi-part message + disconnect + recv = Assertion failed at src/fq.cpp:117
On 07.10.2016 10:14, Laughing wrote: *>>>* Is that mean that socket.disconnect does not disconnect from all endpoint connected before? see http://api.zeromq.org/4-1:zmq-disconnect int zmq_disconnect (void *socket, const char *endpoint); zmq_disconnect disconnects the socket from the given endpoint *only*. *>>> *It is a so bad news. I would like to use the disconnect routine to discard messages in 'REQ/REP' and 'PUSH/PULL' mode. A ZeroMQ socket holds received messages in a queue. To empty the queue, you can a. receive and discard the messages until zmq_poll tells you there is no data b. destroy, then recreate & reconnect the socket I do not know the internal details, for example whether a socket has a queue per connection. The bug seems to point towards some data management problem between queues in the socket. ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org http://lists.zeromq.org/mailman/listinfo/zeromq-dev
Re: [zeromq-dev] BUG: SUB socket + multi-part message + disconnect + recv = Assertion failed at src/fq.cpp:117
On 07.10.2016 08:04, Laughing wrote: I think that the socket cannot recv message any more after disconnect. not quite correct: the SUB socket could still be connected to several other PUB sockets. The abort is still present when you modify the test case accordingly. The first frame is also an unexpected message. it is not unexpected, as zmq_poll indicates that the SUB socket has data. ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org http://lists.zeromq.org/mailman/listinfo/zeromq-dev