Re: [zeromq-dev] need advice! hitting assertion in epoll.cpp:131 (zmq 4.2.1) when going from RHEL6 to RHEL7?!
On 16.02.2017 16:26, Luca Boccassi wrote: What's the file limit on the 2 systems? (With the user that runs the program) ulimit -n on both 6.8 and 7.3: development environment: ulimit -n = 1024 installed environment: ulimit -n = 4096 With a basic sampling of file descriptors while true; do lsof -P -M -l -n -d '^cwd,^err,^ltx,^mem,^mmap,^pd,^rtd,^txt' -p -u $USER -a | wc -l sleep 2 done the total number of file descriptors for $USER increases only by about 600. ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org https://lists.zeromq.org/mailman/listinfo/zeromq-dev
Re: [zeromq-dev] router socket hangs on write (was detecting dead MDP workers)
Hi, I dug a bit deeper, here are my findings: - removing the on/off switching for the ZMQ_ROUTER_MANDATORY flag, and enabling it before the router socket bind: makes no difference - removing the monitor trigger and heartbeating the workers periodically (2.5 sec) drastically reduces the occurrence rate, the program hangs after 3-4 hours, instead of seconds. (in the background a worker connects/disconnects with 4 second period time) >From this I suspect the issue appears in a small timeframe which is close to the monitor event, but otherwise hard to hit. With GDB is see the following: - in zmq::socket_base_t::send() the call to xsend() returns EAGAIN. This should not happen since the ZMQ_DONTWAIT is not specified. - ZMQ_DONTWAIT is not specified, so the function won't return -1, but block (see trace in prev mail). - inside zmq::router_t::xsend() the pipe is found in the outpipes map, but the check_write() on it returns false - the if(mandatory) check in this block (router.cpp:218) returns with -1, EAGAIN - a similar block 10 lines below returns with -1, EHOSTUNREACH Should both if(mandatory) checks return EHOSTUNREACH? There's also a comment in the header for bool mandatory, that it will report EAGAIN, but this contradicts with the documentation. Can you help to clarify? Regards, Gyorgy It On Thu, Feb 16, 2017 at 12:22 PM, Gyorgy Szekely wrote: > Hi, > Continuing my journey on detecting dead workers I reduced the design to > the minimal, and eliminated the messy file descriptors. > I only have: > - a router socket, with some number of peers > - a monitor socket attached to the router socket > > When the monitor detects a disconnect on the router socket: > - do setsockopt(ZMQ_ROUTER_MANDATORY, 1); > - send heartbeat message to every known peer > - if EHOSTUNREACH returned: remove the peer > - do setsockopt(ZMQ_ROUTER_MANDATORY, 0); > > What happens: _my app regularly hangs_ in zmq_msg_send(). Roughly 20% of > the invocations. The call never returns, I have to kill the application. > > What am I doing wrong??? According to the RFC's router sockets should > never block. > I attached a full stacktrace with info locals and args for each relevant > frame (sorry for the machine readable format). > > Env: > libzmq 4.2.1 stable, debug build > Ubuntu 16.04 64bit (the same happens with ubuntu packaged lib) > > Regards, > Gyorgy > > ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org https://lists.zeromq.org/mailman/listinfo/zeromq-dev
Re: [zeromq-dev] need advice! hitting assertion in epoll.cpp:131 (zmq 4.2.1) when going from RHEL6 to RHEL7?!
On Thu, 2017-02-16 at 15:54 +0100, zmqdev wrote: > > > > Are you building your own binaries in both cases? > > > > yes > > > What polling mechanism was RHEL 6 using? You can see it in > > the ./configure output: "Using 'epoll' polling system" > > from config.log: > > Using 'epoll' polling system with CLOEXEC What's the file limit on the 2 systems? (With the user that runs the program) ulimit -n Kind regards, Luca Boccassi signature.asc Description: This is a digitally signed message part ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org https://lists.zeromq.org/mailman/listinfo/zeromq-dev
Re: [zeromq-dev] need advice! hitting assertion in epoll.cpp:131 (zmq 4.2.1) when going from RHEL6 to RHEL7?!
Are you building your own binaries in both cases? yes What polling mechanism was RHEL 6 using? You can see it in the ./configure output: "Using 'epoll' polling system" from config.log: Using 'epoll' polling system with CLOEXEC ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org https://lists.zeromq.org/mailman/listinfo/zeromq-dev
Re: [zeromq-dev] need advice! hitting assertion in epoll.cpp:131 (zmq 4.2.1) when going from RHEL6 to RHEL7?!
On Thu, 2017-02-16 at 14:59 +0100, zmqdev wrote: > Hello, > > I could use some advice to diagnose the following issue. > > I have a program that has been running without problems for a couple of > years on Red Hat Enterprise Linux 6 at various sites. > > On RHEL7, the program triggers the assertion > > Bad file descriptor (src/epoll.cpp:131) > > in about 1/3 of executions, during startup (sometimes during shutdown). > > Less often, I see > > Bad file descriptor (src/epoll.cpp:100) > > The problem persists after upgrading to ZeroMQ 4.2.1 from 4.1.6. > > I don't get it! > > Programming errors aside, I do check all return codes and log errors as > they occur in the main thread, and there is nothing until libzmq commits > suicide from one of its threads. > > Any idea/advice on how I could track down this problem? > > What makes RHEL7 different enough from RHEL6 to emerge this kind of errors? > > Cheers :-( > > > GDB BACKTRACE FROM CORE FILE: > > Thread 3 (Thread 0xf736b900 (LWP 5039)): > #0 0xf7751430 in __kernel_vsyscall () > #1 0xf745694b in poll () from /lib/libc.so.6 > #2 0xf6ff5457 in > zmq::socket_poller_t::wait(zmq::socket_poller_t::event_t*, int, long) () > from $TOP/lib/platform/libzmq.so.5 > #3 0xf6ff325f in zmq_poller_wait_all(void*, zmq_poller_event_t*, int, > long) () from $TOP/lib/platform/libzmq.so.5 > #4 0xf6ff3aa5 in zmq_poller_poll(zmq_pollitem_t*, int, long) () from > $TOP/lib/platform/libzmq.so.5 > #5 0xf6ff2bb1 in zmq_poll () from $TOP/lib/platform/libzmq.so.5 > #6 0xf702cec1 in zt_reactor_loop (r=) at > $TOP/src/reactor.c:268 > (...) > #17 0x080487da in main () > > Thread 2 (Thread 0xf6e6db40 (LWP 5066)): > #0 0xf7751430 in __kernel_vsyscall () > #1 0xf7463a16 in epoll_wait () from /lib/libc.so.6 > #2 0xf6fa17d0 in zmq::epoll_t::loop() () from $TOP/lib/platform/libzmq.so.5 > #3 0xf6fa1a35 in zmq::epoll_t::worker_routine(void*) () from > $TOP/lib/platform/libzmq.so.5 > #4 0xf6fe36f2 in thread_routine () from $TOP/lib/platform/libzmq.so.5 > #5 0xf7574b2c in start_thread () from /lib/libpthread.so.0 > #6 0xf746308e in clone () from /lib/libc.so.6 > > Thread 1 (Thread 0xf666cb40 (LWP 5067)): > #0 0xf7751430 in __kernel_vsyscall () > #1 0xf739a1f7 in raise () from /lib/libc.so.6 > #2 0xf739ba33 in abort () from /lib/libc.so.6 > #3 0xf6fa2726 in zmq::zmq_abort(char const*) () from > $TOP/lib/platform/libzmq.so.5 > #4 0xf6fa164b in zmq::epoll_t::set_pollout(void*) () from > $TOP/lib/platform/libzmq.so.5 > #5 0xf6fa3951 in zmq::io_object_t::set_pollout(void*) () from > $TOP/lib/platform/libzmq.so.5 > #6 0xf6fdafe1 in zmq::stream_engine_t::restart_output() () from > $TOP/lib/platform/libzmq.so.5 > #7 0xf6fcae20 in zmq::session_base_t::read_activated(zmq::pipe_t*) () > from $TOP/lib/platform/libzmq.so.5 > #8 0xf6fb9dd3 in zmq::pipe_t::process_activate_read() () from > $TOP/lib/platform/libzmq.so.5 > #9 0xf6fb2a9e in zmq::object_t::process_command(zmq::command_t&) () > from $TOP/lib/platform/libzmq.so.5 > #10 0xf6fa3f77 in zmq::io_thread_t::in_event() () from > $TOP/lib/platform/libzmq.so.5 > #11 0xf6fa1948 in zmq::epoll_t::loop() () from $TOP/lib/platform/libzmq.so.5 > #12 0xf6fa1a35 in zmq::epoll_t::worker_routine(void*) () from > $TOP/lib/platform/libzmq.so.5 > #13 0xf6fe36f2 in thread_routine () from $TOP/lib/platform/libzmq.so.5 > #14 0xf7574b2c in start_thread () from /lib/libpthread.so.0 > #15 0xf746308e in clone () from /lib/libc.so.6 Are you building your own binaries in both cases? What polling mechanism was RHEL 6 using? You can see it in the ./configure output: "Using 'epoll' polling system" Kind regards, Luca Boccassi signature.asc Description: This is a digitally signed message part ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org https://lists.zeromq.org/mailman/listinfo/zeromq-dev
[zeromq-dev] need advice! hitting assertion in epoll.cpp:131 (zmq 4.2.1) when going from RHEL6 to RHEL7?!
Hello, I could use some advice to diagnose the following issue. I have a program that has been running without problems for a couple of years on Red Hat Enterprise Linux 6 at various sites. On RHEL7, the program triggers the assertion Bad file descriptor (src/epoll.cpp:131) in about 1/3 of executions, during startup (sometimes during shutdown). Less often, I see Bad file descriptor (src/epoll.cpp:100) The problem persists after upgrading to ZeroMQ 4.2.1 from 4.1.6. I don't get it! Programming errors aside, I do check all return codes and log errors as they occur in the main thread, and there is nothing until libzmq commits suicide from one of its threads. Any idea/advice on how I could track down this problem? What makes RHEL7 different enough from RHEL6 to emerge this kind of errors? Cheers :-( GDB BACKTRACE FROM CORE FILE: Thread 3 (Thread 0xf736b900 (LWP 5039)): #0 0xf7751430 in __kernel_vsyscall () #1 0xf745694b in poll () from /lib/libc.so.6 #2 0xf6ff5457 in zmq::socket_poller_t::wait(zmq::socket_poller_t::event_t*, int, long) () from $TOP/lib/platform/libzmq.so.5 #3 0xf6ff325f in zmq_poller_wait_all(void*, zmq_poller_event_t*, int, long) () from $TOP/lib/platform/libzmq.so.5 #4 0xf6ff3aa5 in zmq_poller_poll(zmq_pollitem_t*, int, long) () from $TOP/lib/platform/libzmq.so.5 #5 0xf6ff2bb1 in zmq_poll () from $TOP/lib/platform/libzmq.so.5 #6 0xf702cec1 in zt_reactor_loop (r=) at $TOP/src/reactor.c:268 (...) #17 0x080487da in main () Thread 2 (Thread 0xf6e6db40 (LWP 5066)): #0 0xf7751430 in __kernel_vsyscall () #1 0xf7463a16 in epoll_wait () from /lib/libc.so.6 #2 0xf6fa17d0 in zmq::epoll_t::loop() () from $TOP/lib/platform/libzmq.so.5 #3 0xf6fa1a35 in zmq::epoll_t::worker_routine(void*) () from $TOP/lib/platform/libzmq.so.5 #4 0xf6fe36f2 in thread_routine () from $TOP/lib/platform/libzmq.so.5 #5 0xf7574b2c in start_thread () from /lib/libpthread.so.0 #6 0xf746308e in clone () from /lib/libc.so.6 Thread 1 (Thread 0xf666cb40 (LWP 5067)): #0 0xf7751430 in __kernel_vsyscall () #1 0xf739a1f7 in raise () from /lib/libc.so.6 #2 0xf739ba33 in abort () from /lib/libc.so.6 #3 0xf6fa2726 in zmq::zmq_abort(char const*) () from $TOP/lib/platform/libzmq.so.5 #4 0xf6fa164b in zmq::epoll_t::set_pollout(void*) () from $TOP/lib/platform/libzmq.so.5 #5 0xf6fa3951 in zmq::io_object_t::set_pollout(void*) () from $TOP/lib/platform/libzmq.so.5 #6 0xf6fdafe1 in zmq::stream_engine_t::restart_output() () from $TOP/lib/platform/libzmq.so.5 #7 0xf6fcae20 in zmq::session_base_t::read_activated(zmq::pipe_t*) () from $TOP/lib/platform/libzmq.so.5 #8 0xf6fb9dd3 in zmq::pipe_t::process_activate_read() () from $TOP/lib/platform/libzmq.so.5 #9 0xf6fb2a9e in zmq::object_t::process_command(zmq::command_t&) () from $TOP/lib/platform/libzmq.so.5 #10 0xf6fa3f77 in zmq::io_thread_t::in_event() () from $TOP/lib/platform/libzmq.so.5 #11 0xf6fa1948 in zmq::epoll_t::loop() () from $TOP/lib/platform/libzmq.so.5 #12 0xf6fa1a35 in zmq::epoll_t::worker_routine(void*) () from $TOP/lib/platform/libzmq.so.5 #13 0xf6fe36f2 in thread_routine () from $TOP/lib/platform/libzmq.so.5 #14 0xf7574b2c in start_thread () from /lib/libpthread.so.0 #15 0xf746308e in clone () from /lib/libc.so.6 ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org https://lists.zeromq.org/mailman/listinfo/zeromq-dev
[zeromq-dev] Jyre on jzmq-api
Hi all, For the past year or two I've been working on jzmq-api in my spare time, trying to mature it so it can eventually become a 1.0 library. The best way I've found to do this is actually build things with it. To that end, I've got some ideas for a new library that uses the ZRE protocol, so I decided to refactor the existing Java port of Zyre, called Jyre, from jzmq to jzmq-api. I noticed that there has continued to be a lot of development on Zyre in C, which is very exciting. And Jyre currently only supports the retired ZRE 1.0 protocol. So I also decided to update it to support 2.0 and add some of the features from the C version. So, a few questions for Java folks. Does anyone use the Jyre project currently? If so, what is the best option for the implementation I'm working on? Should I leave it in parallel to the existing, in the same project? Should I create a new project for it? Should I split it into a separate module within the existing Jyre project? I'm looking for the best way to get this out there so people wanting to use Zyre on the JVM can have a pure Java option. No matter what, this will be a good thing for jzmq-api, as it will force us (me at least) to polish jzmq-api up and get a new version out there. Note that jzmq-api currently works with both jzmq and jeromq, so you can still switch between libzmq and the pure Java implementation of ZeroMQ, which is very nice. You can see the work in progress here: https://github.com/sjohnr/jyre ___ zeromq-dev mailing list zeromq-dev@lists.zeromq.org https://lists.zeromq.org/mailman/listinfo/zeromq-dev
[zeromq-dev] router socket hangs on write (was detecting dead MDP workers)
Hi, Continuing my journey on detecting dead workers I reduced the design to the minimal, and eliminated the messy file descriptors. I only have: - a router socket, with some number of peers - a monitor socket attached to the router socket When the monitor detects a disconnect on the router socket: - do setsockopt(ZMQ_ROUTER_MANDATORY, 1); - send heartbeat message to every known peer - if EHOSTUNREACH returned: remove the peer - do setsockopt(ZMQ_ROUTER_MANDATORY, 0); What happens: _my app regularly hangs_ in zmq_msg_send(). Roughly 20% of the invocations. The call never returns, I have to kill the application. What am I doing wrong??? According to the RFC's router sockets should never block. I attached a full stacktrace with info locals and args for each relevant frame (sorry for the machine readable format). Env: libzmq 4.2.1 stable, debug build Ubuntu 16.04 64bit (the same happens with ubuntu packaged lib) Regards, Gyorgy <741bt >&"bt\n" >~"#0 0x7f827d342e8d in poll () at ../sysdeps/unix/syscall-template.S:84\n" >~"#1 0x00524c06 in zmq::signaler_t::wait (this=0x2392138, >timeout_=-1) at src/signaler.cpp:228\n" >~"#2 0x005144d1 in zmq::mailbox_t::recv (this=0x23920d0, >cmd_=0x7fff10f6bf80, timeout_=-1) at src/mailbox.cpp:81\n" >~"#3 0x0052a443 in zmq::socket_base_t::process_commands >(this=0x2393510, timeout_=-1, throttle_=false) at src/socket_base.cpp:1328\n" >~"#4 0x00529c1c in zmq::socket_base_t::send (this=0x2393510, >msg_=0x7fff10f6c190, flags_=2) at src/socket_base.cpp:1142\n" >~"#5 0x00500bf1 in s_sendmsg (s_=0x2393510, msg_=0x7fff10f6c190, >flags_=2) at src/zmq.cpp:375\n" >~"#6 0x00501aeb in zmq_msg_send (msg_=0x7fff10f6c190, s_=0x2393510, >flags_=2) at src/zmq.cpp:642\n" >~"#7 0x004c7ede in zmq::socket_t::send (this=0x7fff10f6cba0, >flags_=2, msg_=...) at >/home/twinsen/Git/gehc-av-broker/import/zmq-20160511/include/zmq.hpp:612\n" >~"#8 zmq::multipart_t::send (this=0x7fff10f6c2d0, socket=...) at >/home/twinsen/Git/gehc-av-broker/import/zmq-20160511/include/zmq_addon.hpp:124\n" >~"#9 0x004c155c in reqrep::Service::handleWorkerDisconnect >(this=0x7fff10f6cb70, event=, fd=) at >/home/twinsen/Git/gehc-av-broker/src/main/src/reqrepService.cpp:132\n" >~"#10 0x004d54de in std::function::operator()(int, >int) const (__args#1=46, __args#0=512, this=0x7fff10f6c610) at >/usr/include/c++/5/functional:2267\n" >~"#11 broker::Monitor::monitorEvent(int, std::__cxx11::basic_stringstd::char_traits, std::allocator > const&, std::function(int, int)>) (this=0x7fff10f6cb70, socketId=, name=..., >callback=...) at >/home/twinsen/Git/gehc-av-broker/src/main/src/monitor.cpp:134\n" >~"#12 0x004d613b in broker::Monitoroperator() >(__closure=0x23c8d30) at >/home/twinsen/Git/gehc-av-broker/src/main/src/monitor.cpp:69\n" >~"#13 std::_Function_handlerbroker::Monitor::watchSocket(zmq::socket_t&, const string&, >std::function, int):: >::_M_invoke(const >std::_Any_data &) (__functor=...) at /usr/include/c++/5/functional:1871\n" >~"#14 0x004d2d84 in std::function::operator()() const >(this=) at /usr/include/c++/5/functional:2267\n" >~"#15 broker::Reactor::processPollSet (this=this@entry=0x7fff10f6c8b0) at >/home/twinsen/Git/gehc-av-broker/src/main/src/reactor.cpp:210\n" >~"#16 0x004d2f92 in broker::Reactor::start >(this=this@entry=0x7fff10f6c8b0, testMode=testMode@entry=false) at >/home/twinsen/Git/gehc-av-broker/src/main/src/reactor.cpp:163\n" >~"#17 0x004780ae in broker::Broker::start >(this=this@entry=0x7fff10f6d78f, reqrepParams=..., pubsubParams=..., >httpParams=...) at >/home/twinsen/Git/gehc-av-broker/src/main/src/broker.cpp:71\n" >~"#18 0x00472c86 in main (argc=1, argv=0x7fff10f6d960) at >/home/twinsen/Git/gehc-av-broker/src/main/src/main.cpp:30\n" >741^done <744frame 1 >&"frame 1\n" >~"#1 0x00524c06 in zmq::signaler_t::wait (this=0x2392138, >timeout_=-1) at src/signaler.cpp:228\n" >~"228\tint rc = poll (&pfd, 1, timeout_);\n" >744^done <745info locals >&"info locals\n" >~"pfd = {fd = 14" >~", events = 1" >~", revents = 0" >~"}" >~"\n" >~"rc = 0" >~"\n" >745^done <746info args >&"info args\n" >~"this = 0x2392138" >~"\n" >~"timeout_ = -1" >~"\n" >746^done <747frame 2 >&"frame 2\n" >~"#2 0x005144d1 in zmq::mailbox_t::recv (this=0x23920d0, >cmd_=0x7fff10f6bf80, timeout_=-1) at src/mailbox.cpp:81\n" >~"81\tint rc = signaler.wait (timeout_);\n" >747^done <748info locals >&"info locals\n" >~"rc = 32767" >~"\n" >~"ok = 16" >~"\n" >748^done <749info args >&"info args\n" >~"this = 0x23920d0" >~"\n" >~"cmd_ = 0x7fff10f6bf80" >~"\n" >~"timeout_ = -1" >~"\n" >749^done <750frame 3 >&"frame 3\n" >~"#3 0x0052a443 in zmq::socket_base_t::process_commands >(this=0x2393510, timeout_=-1, throttle_=false) at src/socket_base.cpp:1328\n" >~"1328\trc = mailbox->recv (&cmd, timeout_);\n" >750^done <751info locals >&"info loca