Re: [zeromq-dev] need advice! hitting assertion in epoll.cpp:131 (zmq 4.2.1) when going from RHEL6 to RHEL7?!

2017-02-16 Thread zmqdev

On 16.02.2017 16:26, Luca Boccassi wrote:

What's the file limit on the 2 systems? (With the user that runs the
program)

ulimit -n



on both 6.8 and 7.3:

development environment: ulimit -n = 1024
  installed environment: ulimit -n = 4096

With a basic sampling of file descriptors

while true; do
	lsof -P -M -l -n -d '^cwd,^err,^ltx,^mem,^mmap,^pd,^rtd,^txt' -p -u 
$USER -a | wc -l

sleep 2
done

the total number of file descriptors for $USER increases only by about 600.


___
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
https://lists.zeromq.org/mailman/listinfo/zeromq-dev

Re: [zeromq-dev] router socket hangs on write (was detecting dead MDP workers)

2017-02-16 Thread Gyorgy Szekely
Hi,
I dug a bit deeper, here are my findings:
- removing the on/off switching for the ZMQ_ROUTER_MANDATORY flag, and
enabling it before the router socket bind: makes no difference
- removing the monitor trigger and heartbeating the workers periodically
(2.5 sec) drastically reduces the occurrence rate, the program hangs after
3-4 hours, instead of seconds. (in the background a worker
connects/disconnects with 4 second period time)

>From this I suspect the issue appears in a small timeframe which is close
to the monitor event, but otherwise hard to hit.

With GDB is see the following:
- in zmq::socket_base_t::send() the call to xsend() returns EAGAIN. This
should not happen since the ZMQ_DONTWAIT is not specified.
- ZMQ_DONTWAIT is not specified, so the function won't return -1, but block
(see trace in prev mail).

- inside zmq::router_t::xsend() the pipe is found in the outpipes map, but
the check_write() on it returns false
- the if(mandatory) check in this block (router.cpp:218) returns with -1,
EAGAIN
- a similar block 10 lines below returns with -1, EHOSTUNREACH

Should both if(mandatory) checks return EHOSTUNREACH? There's also a
comment in the header for bool mandatory, that it will report EAGAIN, but
this contradicts with the documentation.

Can you help to clarify?


Regards,
  Gyorgy


It

On Thu, Feb 16, 2017 at 12:22 PM, Gyorgy Szekely 
wrote:

> Hi,
> Continuing my journey on detecting dead workers I reduced the design to
> the minimal, and eliminated the messy file descriptors.
> I only have:
> - a router socket, with some number of peers
> - a monitor socket attached to the router socket
>
> When the monitor detects a disconnect on the router socket:
> - do setsockopt(ZMQ_ROUTER_MANDATORY, 1);
> - send heartbeat message to every known peer
> - if EHOSTUNREACH returned: remove the peer
> - do setsockopt(ZMQ_ROUTER_MANDATORY, 0);
>
> What happens: _my app regularly hangs_ in zmq_msg_send(). Roughly 20% of
> the invocations. The call never returns, I have to kill the application.
>
> What am I doing wrong??? According to the RFC's router sockets should
> never block.
> I attached a full stacktrace with info locals and args for each relevant
> frame (sorry for the machine readable format).
>
> Env:
> libzmq 4.2.1 stable, debug build
> Ubuntu 16.04 64bit (the same happens with ubuntu packaged lib)
>
> Regards,
>   Gyorgy
>
>
___
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
https://lists.zeromq.org/mailman/listinfo/zeromq-dev

Re: [zeromq-dev] need advice! hitting assertion in epoll.cpp:131 (zmq 4.2.1) when going from RHEL6 to RHEL7?!

2017-02-16 Thread Luca Boccassi
On Thu, 2017-02-16 at 15:54 +0100, zmqdev wrote:
> >
> > Are you building your own binaries in both cases?
> >
> 
> yes
> 
> > What polling mechanism was RHEL 6 using? You can see it in
> > the ./configure output: "Using 'epoll' polling system"
> 
> from config.log:
> 
>   Using 'epoll' polling system with CLOEXEC

What's the file limit on the 2 systems? (With the user that runs the
program)

ulimit -n

Kind regards,
Luca Boccassi


signature.asc
Description: This is a digitally signed message part
___
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
https://lists.zeromq.org/mailman/listinfo/zeromq-dev

Re: [zeromq-dev] need advice! hitting assertion in epoll.cpp:131 (zmq 4.2.1) when going from RHEL6 to RHEL7?!

2017-02-16 Thread zmqdev


Are you building your own binaries in both cases?



yes


What polling mechanism was RHEL 6 using? You can see it in
the ./configure output: "Using 'epoll' polling system"


from config.log:

Using 'epoll' polling system with CLOEXEC



___
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
https://lists.zeromq.org/mailman/listinfo/zeromq-dev

Re: [zeromq-dev] need advice! hitting assertion in epoll.cpp:131 (zmq 4.2.1) when going from RHEL6 to RHEL7?!

2017-02-16 Thread Luca Boccassi
On Thu, 2017-02-16 at 14:59 +0100, zmqdev wrote:
> Hello,
> 
> I could use some advice to diagnose the following issue.
> 
> I have a program that has been running without problems for a couple of 
> years on Red Hat Enterprise Linux 6 at various sites.
> 
> On RHEL7, the program triggers the assertion
> 
>   Bad file descriptor (src/epoll.cpp:131)
> 
> in about 1/3 of executions, during startup (sometimes during shutdown).
> 
> Less often, I see
> 
>   Bad file descriptor (src/epoll.cpp:100)
> 
> The problem persists after upgrading to ZeroMQ 4.2.1 from 4.1.6.
> 
> I don't get it!
> 
> Programming errors aside, I do check all return codes and log errors as 
> they occur in the main thread, and there is nothing until libzmq commits 
> suicide from one of its threads.
> 
> Any idea/advice on how I could track down this problem?
> 
> What makes RHEL7 different enough from RHEL6 to emerge this kind of errors?
> 
> Cheers :-(
> 
> 
> GDB BACKTRACE FROM CORE FILE:
> 
> Thread 3 (Thread 0xf736b900 (LWP 5039)):
> #0  0xf7751430 in __kernel_vsyscall ()
> #1  0xf745694b in poll () from /lib/libc.so.6
> #2  0xf6ff5457 in 
> zmq::socket_poller_t::wait(zmq::socket_poller_t::event_t*, int, long) () 
> from $TOP/lib/platform/libzmq.so.5
> #3  0xf6ff325f in zmq_poller_wait_all(void*, zmq_poller_event_t*, int, 
> long) () from $TOP/lib/platform/libzmq.so.5
> #4  0xf6ff3aa5 in zmq_poller_poll(zmq_pollitem_t*, int, long) () from 
> $TOP/lib/platform/libzmq.so.5
> #5  0xf6ff2bb1 in zmq_poll () from $TOP/lib/platform/libzmq.so.5
> #6  0xf702cec1 in zt_reactor_loop (r=) at 
> $TOP/src/reactor.c:268
> (...)
> #17 0x080487da in main ()
> 
> Thread 2 (Thread 0xf6e6db40 (LWP 5066)):
> #0  0xf7751430 in __kernel_vsyscall ()
> #1  0xf7463a16 in epoll_wait () from /lib/libc.so.6
> #2  0xf6fa17d0 in zmq::epoll_t::loop() () from $TOP/lib/platform/libzmq.so.5
> #3  0xf6fa1a35 in zmq::epoll_t::worker_routine(void*) () from 
> $TOP/lib/platform/libzmq.so.5
> #4  0xf6fe36f2 in thread_routine () from $TOP/lib/platform/libzmq.so.5
> #5  0xf7574b2c in start_thread () from /lib/libpthread.so.0
> #6  0xf746308e in clone () from /lib/libc.so.6
> 
> Thread 1 (Thread 0xf666cb40 (LWP 5067)):
> #0  0xf7751430 in __kernel_vsyscall ()
> #1  0xf739a1f7 in raise () from /lib/libc.so.6
> #2  0xf739ba33 in abort () from /lib/libc.so.6
> #3  0xf6fa2726 in zmq::zmq_abort(char const*) () from 
> $TOP/lib/platform/libzmq.so.5
> #4  0xf6fa164b in zmq::epoll_t::set_pollout(void*) () from 
> $TOP/lib/platform/libzmq.so.5
> #5  0xf6fa3951 in zmq::io_object_t::set_pollout(void*) () from 
> $TOP/lib/platform/libzmq.so.5
> #6  0xf6fdafe1 in zmq::stream_engine_t::restart_output() () from 
> $TOP/lib/platform/libzmq.so.5
> #7  0xf6fcae20 in zmq::session_base_t::read_activated(zmq::pipe_t*) () 
> from $TOP/lib/platform/libzmq.so.5
> #8  0xf6fb9dd3 in zmq::pipe_t::process_activate_read() () from 
> $TOP/lib/platform/libzmq.so.5
> #9  0xf6fb2a9e in zmq::object_t::process_command(zmq::command_t&) () 
> from $TOP/lib/platform/libzmq.so.5
> #10 0xf6fa3f77 in zmq::io_thread_t::in_event() () from 
> $TOP/lib/platform/libzmq.so.5
> #11 0xf6fa1948 in zmq::epoll_t::loop() () from $TOP/lib/platform/libzmq.so.5
> #12 0xf6fa1a35 in zmq::epoll_t::worker_routine(void*) () from 
> $TOP/lib/platform/libzmq.so.5
> #13 0xf6fe36f2 in thread_routine () from $TOP/lib/platform/libzmq.so.5
> #14 0xf7574b2c in start_thread () from /lib/libpthread.so.0
> #15 0xf746308e in clone () from /lib/libc.so.6

Are you building your own binaries in both cases?

What polling mechanism was RHEL 6 using? You can see it in
the ./configure output: "Using 'epoll' polling system"

Kind regards,
Luca Boccassi


signature.asc
Description: This is a digitally signed message part
___
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
https://lists.zeromq.org/mailman/listinfo/zeromq-dev

[zeromq-dev] need advice! hitting assertion in epoll.cpp:131 (zmq 4.2.1) when going from RHEL6 to RHEL7?!

2017-02-16 Thread zmqdev

Hello,

I could use some advice to diagnose the following issue.

I have a program that has been running without problems for a couple of 
years on Red Hat Enterprise Linux 6 at various sites.


On RHEL7, the program triggers the assertion

Bad file descriptor (src/epoll.cpp:131)

in about 1/3 of executions, during startup (sometimes during shutdown).

Less often, I see

Bad file descriptor (src/epoll.cpp:100)

The problem persists after upgrading to ZeroMQ 4.2.1 from 4.1.6.

I don't get it!

Programming errors aside, I do check all return codes and log errors as 
they occur in the main thread, and there is nothing until libzmq commits 
suicide from one of its threads.


Any idea/advice on how I could track down this problem?

What makes RHEL7 different enough from RHEL6 to emerge this kind of errors?

Cheers :-(


GDB BACKTRACE FROM CORE FILE:

Thread 3 (Thread 0xf736b900 (LWP 5039)):
#0  0xf7751430 in __kernel_vsyscall ()
#1  0xf745694b in poll () from /lib/libc.so.6
#2  0xf6ff5457 in 
zmq::socket_poller_t::wait(zmq::socket_poller_t::event_t*, int, long) () 
from $TOP/lib/platform/libzmq.so.5
#3  0xf6ff325f in zmq_poller_wait_all(void*, zmq_poller_event_t*, int, 
long) () from $TOP/lib/platform/libzmq.so.5
#4  0xf6ff3aa5 in zmq_poller_poll(zmq_pollitem_t*, int, long) () from 
$TOP/lib/platform/libzmq.so.5

#5  0xf6ff2bb1 in zmq_poll () from $TOP/lib/platform/libzmq.so.5
#6  0xf702cec1 in zt_reactor_loop (r=) at 
$TOP/src/reactor.c:268

(...)
#17 0x080487da in main ()

Thread 2 (Thread 0xf6e6db40 (LWP 5066)):
#0  0xf7751430 in __kernel_vsyscall ()
#1  0xf7463a16 in epoll_wait () from /lib/libc.so.6
#2  0xf6fa17d0 in zmq::epoll_t::loop() () from $TOP/lib/platform/libzmq.so.5
#3  0xf6fa1a35 in zmq::epoll_t::worker_routine(void*) () from 
$TOP/lib/platform/libzmq.so.5

#4  0xf6fe36f2 in thread_routine () from $TOP/lib/platform/libzmq.so.5
#5  0xf7574b2c in start_thread () from /lib/libpthread.so.0
#6  0xf746308e in clone () from /lib/libc.so.6

Thread 1 (Thread 0xf666cb40 (LWP 5067)):
#0  0xf7751430 in __kernel_vsyscall ()
#1  0xf739a1f7 in raise () from /lib/libc.so.6
#2  0xf739ba33 in abort () from /lib/libc.so.6
#3  0xf6fa2726 in zmq::zmq_abort(char const*) () from 
$TOP/lib/platform/libzmq.so.5
#4  0xf6fa164b in zmq::epoll_t::set_pollout(void*) () from 
$TOP/lib/platform/libzmq.so.5
#5  0xf6fa3951 in zmq::io_object_t::set_pollout(void*) () from 
$TOP/lib/platform/libzmq.so.5
#6  0xf6fdafe1 in zmq::stream_engine_t::restart_output() () from 
$TOP/lib/platform/libzmq.so.5
#7  0xf6fcae20 in zmq::session_base_t::read_activated(zmq::pipe_t*) () 
from $TOP/lib/platform/libzmq.so.5
#8  0xf6fb9dd3 in zmq::pipe_t::process_activate_read() () from 
$TOP/lib/platform/libzmq.so.5
#9  0xf6fb2a9e in zmq::object_t::process_command(zmq::command_t&) () 
from $TOP/lib/platform/libzmq.so.5
#10 0xf6fa3f77 in zmq::io_thread_t::in_event() () from 
$TOP/lib/platform/libzmq.so.5

#11 0xf6fa1948 in zmq::epoll_t::loop() () from $TOP/lib/platform/libzmq.so.5
#12 0xf6fa1a35 in zmq::epoll_t::worker_routine(void*) () from 
$TOP/lib/platform/libzmq.so.5

#13 0xf6fe36f2 in thread_routine () from $TOP/lib/platform/libzmq.so.5
#14 0xf7574b2c in start_thread () from /lib/libpthread.so.0
#15 0xf746308e in clone () from /lib/libc.so.6


___
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
https://lists.zeromq.org/mailman/listinfo/zeromq-dev

[zeromq-dev] Jyre on jzmq-api

2017-02-16 Thread Stephen Riesenberg
Hi all,

For the past year or two I've been working on jzmq-api in my spare time,
trying to mature it so it can eventually become a 1.0 library. The best way
I've found to do this is actually build things with it. To that end, I've
got some ideas for a new library that uses the ZRE protocol, so I decided
to refactor the existing Java port of Zyre, called Jyre, from jzmq to
jzmq-api. I noticed that there has continued to be a lot of development on
Zyre in C, which is very exciting. And Jyre currently only supports the
retired ZRE 1.0 protocol. So I also decided to update it to support 2.0 and
add some of the features from the C version.

So, a few questions for Java folks. Does anyone use the Jyre project
currently? If so, what is the best option for the implementation I'm
working on? Should I leave it in parallel to the existing, in the same
project? Should I create a new project for it? Should I split it into a
separate module within the existing Jyre project? I'm looking for the best
way to get this out there so people wanting to use Zyre on the JVM can have
a pure Java option.

No matter what, this will be a good thing for jzmq-api, as it will force us
(me at least) to polish jzmq-api up and get a new version out there. Note
that jzmq-api currently works with both jzmq and jeromq, so you can still
switch between libzmq and the pure Java implementation of ZeroMQ, which is
very nice.

You can see the work in progress here: https://github.com/sjohnr/jyre
___
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
https://lists.zeromq.org/mailman/listinfo/zeromq-dev

[zeromq-dev] router socket hangs on write (was detecting dead MDP workers)

2017-02-16 Thread Gyorgy Szekely
Hi,
Continuing my journey on detecting dead workers I reduced the design to the
minimal, and eliminated the messy file descriptors.
I only have:
- a router socket, with some number of peers
- a monitor socket attached to the router socket

When the monitor detects a disconnect on the router socket:
- do setsockopt(ZMQ_ROUTER_MANDATORY, 1);
- send heartbeat message to every known peer
- if EHOSTUNREACH returned: remove the peer
- do setsockopt(ZMQ_ROUTER_MANDATORY, 0);

What happens: _my app regularly hangs_ in zmq_msg_send(). Roughly 20% of
the invocations. The call never returns, I have to kill the application.

What am I doing wrong??? According to the RFC's router sockets should never
block.
I attached a full stacktrace with info locals and args for each relevant
frame (sorry for the machine readable format).

Env:
libzmq 4.2.1 stable, debug build
Ubuntu 16.04 64bit (the same happens with ubuntu packaged lib)

Regards,
  Gyorgy

<741bt
>&"bt\n"
>~"#0  0x7f827d342e8d in poll () at ../sysdeps/unix/syscall-template.S:84\n"
>~"#1  0x00524c06 in zmq::signaler_t::wait (this=0x2392138, 
>timeout_=-1) at src/signaler.cpp:228\n"
>~"#2  0x005144d1 in zmq::mailbox_t::recv (this=0x23920d0, 
>cmd_=0x7fff10f6bf80, timeout_=-1) at src/mailbox.cpp:81\n"
>~"#3  0x0052a443 in zmq::socket_base_t::process_commands 
>(this=0x2393510, timeout_=-1, throttle_=false) at src/socket_base.cpp:1328\n"
>~"#4  0x00529c1c in zmq::socket_base_t::send (this=0x2393510, 
>msg_=0x7fff10f6c190, flags_=2) at src/socket_base.cpp:1142\n"
>~"#5  0x00500bf1 in s_sendmsg (s_=0x2393510, msg_=0x7fff10f6c190, 
>flags_=2) at src/zmq.cpp:375\n"
>~"#6  0x00501aeb in zmq_msg_send (msg_=0x7fff10f6c190, s_=0x2393510, 
>flags_=2) at src/zmq.cpp:642\n"
>~"#7  0x004c7ede in zmq::socket_t::send (this=0x7fff10f6cba0, 
>flags_=2, msg_=...) at 
>/home/twinsen/Git/gehc-av-broker/import/zmq-20160511/include/zmq.hpp:612\n"
>~"#8  zmq::multipart_t::send (this=0x7fff10f6c2d0, socket=...) at 
>/home/twinsen/Git/gehc-av-broker/import/zmq-20160511/include/zmq_addon.hpp:124\n"
>~"#9  0x004c155c in reqrep::Service::handleWorkerDisconnect 
>(this=0x7fff10f6cb70, event=, fd=) at 
>/home/twinsen/Git/gehc-av-broker/src/main/src/reqrepService.cpp:132\n"
>~"#10 0x004d54de in std::function::operator()(int, 
>int) const (__args#1=46, __args#0=512, this=0x7fff10f6c610) at 
>/usr/include/c++/5/functional:2267\n"
>~"#11 broker::Monitor::monitorEvent(int, std::__cxx11::basic_stringstd::char_traits, std::allocator > const&, std::function(int, int)>) (this=0x7fff10f6cb70, socketId=, name=..., 
>callback=...) at 
>/home/twinsen/Git/gehc-av-broker/src/main/src/monitor.cpp:134\n"
>~"#12 0x004d613b in broker::Monitoroperator() 
>(__closure=0x23c8d30) at 
>/home/twinsen/Git/gehc-av-broker/src/main/src/monitor.cpp:69\n"
>~"#13 std::_Function_handlerbroker::Monitor::watchSocket(zmq::socket_t&, const string&, 
>std::function, int):: >::_M_invoke(const 
>std::_Any_data &) (__functor=...) at /usr/include/c++/5/functional:1871\n"
>~"#14 0x004d2d84 in std::function::operator()() const 
>(this=) at /usr/include/c++/5/functional:2267\n"
>~"#15 broker::Reactor::processPollSet (this=this@entry=0x7fff10f6c8b0) at 
>/home/twinsen/Git/gehc-av-broker/src/main/src/reactor.cpp:210\n"
>~"#16 0x004d2f92 in broker::Reactor::start 
>(this=this@entry=0x7fff10f6c8b0, testMode=testMode@entry=false) at 
>/home/twinsen/Git/gehc-av-broker/src/main/src/reactor.cpp:163\n"
>~"#17 0x004780ae in broker::Broker::start 
>(this=this@entry=0x7fff10f6d78f, reqrepParams=..., pubsubParams=..., 
>httpParams=...) at 
>/home/twinsen/Git/gehc-av-broker/src/main/src/broker.cpp:71\n"
>~"#18 0x00472c86 in main (argc=1, argv=0x7fff10f6d960) at 
>/home/twinsen/Git/gehc-av-broker/src/main/src/main.cpp:30\n"
>741^done


<744frame 1
>&"frame 1\n"
>~"#1  0x00524c06 in zmq::signaler_t::wait (this=0x2392138, 
>timeout_=-1) at src/signaler.cpp:228\n"
>~"228\tint rc = poll (&pfd, 1, timeout_);\n"
>744^done


<745info locals
>&"info locals\n"
>~"pfd = {fd = 14"
>~", events = 1"
>~", revents = 0"
>~"}"
>~"\n"
>~"rc = 0"
>~"\n"
>745^done


<746info args
>&"info args\n"
>~"this = 0x2392138"
>~"\n"
>~"timeout_ = -1"
>~"\n"
>746^done


<747frame 2
>&"frame 2\n"
>~"#2  0x005144d1 in zmq::mailbox_t::recv (this=0x23920d0, 
>cmd_=0x7fff10f6bf80, timeout_=-1) at src/mailbox.cpp:81\n"
>~"81\tint rc = signaler.wait (timeout_);\n"
>747^done


<748info locals
>&"info locals\n"
>~"rc = 32767"
>~"\n"
>~"ok = 16"
>~"\n"
>748^done


<749info args
>&"info args\n"
>~"this = 0x23920d0"
>~"\n"
>~"cmd_ = 0x7fff10f6bf80"
>~"\n"
>~"timeout_ = -1"
>~"\n"
>749^done


<750frame 3
>&"frame 3\n"
>~"#3  0x0052a443 in zmq::socket_base_t::process_commands 
>(this=0x2393510, timeout_=-1, throttle_=false) at src/socket_base.cpp:1328\n"
>~"1328\trc = mailbox->recv (&cmd, timeout_);\n"
>750^done


<751info locals
>&"info loca