Re: [zeromq-dev] need advice! hitting assertion in epoll.cpp:131 (zmq 4.2.1) when going from RHEL6 to RHEL7?!

2017-02-16 Thread zmqdev

On 16.02.2017 16:26, Luca Boccassi wrote:

What's the file limit on the 2 systems? (With the user that runs the
program)

ulimit -n



on both 6.8 and 7.3:

development environment: ulimit -n = 1024
  installed environment: ulimit -n = 4096

With a basic sampling of file descriptors

while true; do
	lsof -P -M -l -n -d '^cwd,^err,^ltx,^mem,^mmap,^pd,^rtd,^txt' -p -u 
$USER -a | wc -l

sleep 2
done

the total number of file descriptors for $USER increases only by about 600.


___
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
https://lists.zeromq.org/mailman/listinfo/zeromq-dev

Re: [zeromq-dev] need advice! hitting assertion in epoll.cpp:131 (zmq 4.2.1) when going from RHEL6 to RHEL7?!

2017-02-16 Thread zmqdev


Are you building your own binaries in both cases?



yes


What polling mechanism was RHEL 6 using? You can see it in
the ./configure output: "Using 'epoll' polling system"


from config.log:

Using 'epoll' polling system with CLOEXEC



___
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
https://lists.zeromq.org/mailman/listinfo/zeromq-dev

[zeromq-dev] need advice! hitting assertion in epoll.cpp:131 (zmq 4.2.1) when going from RHEL6 to RHEL7?!

2017-02-16 Thread zmqdev

Hello,

I could use some advice to diagnose the following issue.

I have a program that has been running without problems for a couple of 
years on Red Hat Enterprise Linux 6 at various sites.


On RHEL7, the program triggers the assertion

Bad file descriptor (src/epoll.cpp:131)

in about 1/3 of executions, during startup (sometimes during shutdown).

Less often, I see

Bad file descriptor (src/epoll.cpp:100)

The problem persists after upgrading to ZeroMQ 4.2.1 from 4.1.6.

I don't get it!

Programming errors aside, I do check all return codes and log errors as 
they occur in the main thread, and there is nothing until libzmq commits 
suicide from one of its threads.


Any idea/advice on how I could track down this problem?

What makes RHEL7 different enough from RHEL6 to emerge this kind of errors?

Cheers :-(


GDB BACKTRACE FROM CORE FILE:

Thread 3 (Thread 0xf736b900 (LWP 5039)):
#0  0xf7751430 in __kernel_vsyscall ()
#1  0xf745694b in poll () from /lib/libc.so.6
#2  0xf6ff5457 in 
zmq::socket_poller_t::wait(zmq::socket_poller_t::event_t*, int, long) () 
from $TOP/lib/platform/libzmq.so.5
#3  0xf6ff325f in zmq_poller_wait_all(void*, zmq_poller_event_t*, int, 
long) () from $TOP/lib/platform/libzmq.so.5
#4  0xf6ff3aa5 in zmq_poller_poll(zmq_pollitem_t*, int, long) () from 
$TOP/lib/platform/libzmq.so.5

#5  0xf6ff2bb1 in zmq_poll () from $TOP/lib/platform/libzmq.so.5
#6  0xf702cec1 in zt_reactor_loop (r=) at 
$TOP/src/reactor.c:268

(...)
#17 0x080487da in main ()

Thread 2 (Thread 0xf6e6db40 (LWP 5066)):
#0  0xf7751430 in __kernel_vsyscall ()
#1  0xf7463a16 in epoll_wait () from /lib/libc.so.6
#2  0xf6fa17d0 in zmq::epoll_t::loop() () from $TOP/lib/platform/libzmq.so.5
#3  0xf6fa1a35 in zmq::epoll_t::worker_routine(void*) () from 
$TOP/lib/platform/libzmq.so.5

#4  0xf6fe36f2 in thread_routine () from $TOP/lib/platform/libzmq.so.5
#5  0xf7574b2c in start_thread () from /lib/libpthread.so.0
#6  0xf746308e in clone () from /lib/libc.so.6

Thread 1 (Thread 0xf666cb40 (LWP 5067)):
#0  0xf7751430 in __kernel_vsyscall ()
#1  0xf739a1f7 in raise () from /lib/libc.so.6
#2  0xf739ba33 in abort () from /lib/libc.so.6
#3  0xf6fa2726 in zmq::zmq_abort(char const*) () from 
$TOP/lib/platform/libzmq.so.5
#4  0xf6fa164b in zmq::epoll_t::set_pollout(void*) () from 
$TOP/lib/platform/libzmq.so.5
#5  0xf6fa3951 in zmq::io_object_t::set_pollout(void*) () from 
$TOP/lib/platform/libzmq.so.5
#6  0xf6fdafe1 in zmq::stream_engine_t::restart_output() () from 
$TOP/lib/platform/libzmq.so.5
#7  0xf6fcae20 in zmq::session_base_t::read_activated(zmq::pipe_t*) () 
from $TOP/lib/platform/libzmq.so.5
#8  0xf6fb9dd3 in zmq::pipe_t::process_activate_read() () from 
$TOP/lib/platform/libzmq.so.5
#9  0xf6fb2a9e in zmq::object_t::process_command(zmq::command_t&) () 
from $TOP/lib/platform/libzmq.so.5
#10 0xf6fa3f77 in zmq::io_thread_t::in_event() () from 
$TOP/lib/platform/libzmq.so.5

#11 0xf6fa1948 in zmq::epoll_t::loop() () from $TOP/lib/platform/libzmq.so.5
#12 0xf6fa1a35 in zmq::epoll_t::worker_routine(void*) () from 
$TOP/lib/platform/libzmq.so.5

#13 0xf6fe36f2 in thread_routine () from $TOP/lib/platform/libzmq.so.5
#14 0xf7574b2c in start_thread () from /lib/libpthread.so.0
#15 0xf746308e in clone () from /lib/libc.so.6


___
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
https://lists.zeromq.org/mailman/listinfo/zeromq-dev

[zeromq-dev] setting the CLOEXEC flag for eventfd and epoll instances?

2016-12-02 Thread zmqdev
while investigating a problem involving fork() and zeromq, I found some 
file descriptor leaks.



1. The function

zmq::epoll_t::epoll_t (const zmq::ctx_t _)

in src/epoll.cpp creates an epoll instance with

epoll_fd = epoll_create(1);

SUGGESTION: replace with

epoll_fd = epoll_create1(EPOLL_CLOEXEC);


2. The function

zmq::signaler_t::make_fdpair (fd_t *r_, fd_t *w_)

in src/signaler.cpp creates an eventfd object with

#if defined ZMQ_HAVE_EVENTFD
fd_t fd = eventfd (0, 0);

SUGGESTION: replace with

#if defined ZMQ_HAVE_EVENTFD
fd_t fd = eventfd (0, EFD_CLOEXEC);



___
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
http://lists.zeromq.org/mailman/listinfo/zeromq-dev

Re: [zeromq-dev] What is the canonical handling of zeromq sockets when fork+exec?

2016-11-25 Thread zmqdev

On 25.11.2016 11:50, Luca Boccassi wrote:

What I can say is that we have a unit test for this situation:

https://github.com/zeromq/libzmq/blob/master/tests/test_fork.cpp

And the child closes the (TCP) socket explicitly before the context.
Which is in fact what should happen in all cases.

The parent then can receive messages on the sockets just fine.

Maybe it's a linger issue? By default a socket has 30s of linger grace
period.

Try setting ZMQ_LINGER to 0 in the socket in the child, close the socket
and then terminate the context perhaps.


thanks. Formatted differently:

1. zmq_close sockets in child (perhaps setting ZMQ_LINGER to 0 
beforehand)
2. zmq_term context in child

and only then

3. close rest of file descriptors in child

The reason I went directly to point 3 is this line from the man page of 
fork(2):


The child process is created with a single thread—the one
that called fork().

(see http://man7.org/linux/man-pages/man2/fork.2.html)

Michael Kerrisk in "The Linux Programming Interface" insists:

When a multithreaded process calls fork(), only the calling thread is
replicated in the child process. (The ID of the thread in the child 
is the
same as the ID of the thread that called fork() in the parent.) All 
of the
other threads vanish in the child; no thread-specific data 
destructors or

cleanup handlers are executed for those threads.
(...)

Of course, that's where I run into the problem?!


___
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
http://lists.zeromq.org/mailman/listinfo/zeromq-dev

[zeromq-dev] What is the canonical handling of zeromq sockets when fork+exec?

2016-11-25 Thread zmqdev


* Background

I have a service that starts workers on demand with fork+exec.
The requests arrive over zeromq sockets.

After the fork, before the exec, I close all file descriptors > 2, 
keeping only stdin/out/err. I then exec the requested program.



* Problem

It works. Except that I get some rare core dumps (of the service) with 
the following assertion failure:


Bad file descriptor (src/epoll.cpp:90)

and the backtrace:

#0  0xf77f5430 in __kernel_vsyscall ()
#1  0xf743f1f7 in raise () from /lib/libc.so.6
#2  0xf7440a33 in abort () from /lib/libc.so.6
#3  0xf7067134 in zmq::zmq_abort(char const*) () from $LIBS/libzmq.so.5
#4  0xf7065e6c in zmq::epoll_t::rm_fd(void*) () from $LIBS/libzmq.so.5
#5  0xf7068823 in zmq::io_object_t::rm_fd(void*) () from 
$LIBS/libzmq.so.5
#6  0xf70958af in zmq::stream_engine_t::unplug() () from 
$LIBS/libzmq.so.5
#7  0xf7098711 in 
zmq::stream_engine_t::error(zmq::stream_engine_t::error_reason_t) () 
from $LIBS/libzmq.so.5
#8  0xf7098867 in zmq::stream_engine_t::timer_event(int) () from 
$LIBS/libzmq.so.5
#9  0xf707f972 in zmq::poller_base_t::execute_timers() () from 
$LIBS/libzmq.so.5

#10 0xf7066209 in zmq::epoll_t::loop() () from $LIBS/libzmq.so.5
#11 0xf7066467 in zmq::epoll_t::worker_routine(void*) () from 
$LIBS/libzmq.so.5

#12 0xf709d67e in thread_routine () from $LIBS/libzmq.so.5
#13 0xf7619b2c in start_thread () from /lib/libpthread.so.0
#14 0xf750808e in clone () from /lib/libc.so.6

This is with zeromq-4.1.4 on RHEL 7.3 x86_64.

So I wonder: is there some interaction between parent and child?


* Documentation

The Guide and the FAQ do not address explicitly the fork+exec point.

The question has been asked several times on the mailing list in various 
forms, without a definitive answer (for dummies like me at least).



* Questions:

Do I need to zmq_close the sockets in the child?
Or is zmq_term in the child enough?
Does closing the file descriptors in the child cause problems in the parent?

What is the correct way to handle this?


___
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
http://lists.zeromq.org/mailman/listinfo/zeromq-dev

Re: [zeromq-dev] Assertion failed: !more (src/fq.cpp:117) = fix works!

2016-10-21 Thread zmqdev

Hi Doron,

I tried out your latest repo


https://github.com/somdoron/libzmq/commit/3775d0853a8c1f1c3854a94c7fe12e78046faeca

with the changes to src/socket_base.cpp, src/pipe.cpp and src/pipe.hpp.

I confirm that the problem reported at

https://github.com/zeromq/libzmq/issues/2163

is solved: the assertion in src/fq.cpp:118 is not triggered anymore.

Thanks!

There is one point I'm unsure about. The documentation of zmq_disconnect 
states


Any outstanding messages physically received from the network but not
yet received by the application with _zmq_recv()_ shall be discarded.

https://github.com/zeromq/libzmq/blob/master/doc/zmq_disconnect.txt#L19-L20

In the PUB->SUB test case, the message is not discarded by the 
zmq_disconnect, and is received (sans abort) with zmq_msg_recv.


Doesn't this behavior contradict the doc?

___
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
http://lists.zeromq.org/mailman/listinfo/zeromq-dev

Re: [zeromq-dev] about Assertion failed: !more (src/fq.cpp:117): just remove the assertion?

2016-10-20 Thread zmqdev

On 20.10.2016 15:46, Doron Somech wrote:

I actually those are different issues. If you suffer from the pubsub
issue, I think I trace the bug and have a solution. Take a look at the
issue. If you suffer from the 100K issue, I think that is a different
one, anyway you can try the solution as well. It might be related to
another pipe.terminate call



just tweaked the test case of issue 2163 for a message with a single frame.

No assertion, the program completes.

There seems to be something broken in PUB -> SUB, multi-part messages, 
and zmq_disconnect.





___
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
http://lists.zeromq.org/mailman/listinfo/zeromq-dev

Re: [zeromq-dev] about Assertion failed: !more (src/fq.cpp:117): just remove the assertion?

2016-10-20 Thread zmqdev

On 20.10.2016 14:31, Doron Somech wrote:

Also I think it smells like using socket from multiple threads...


unfortunately no, the assertion strikes in a single thread application.

See also the test case at

https://github.com/zeromq/libzmq/issues/2163


On Thu, Oct 20, 2016 at 3:28 PM, Doron Somech  wrote:

It don't think it is correct, there might be an issue we need to solve.

The workaround is not to use multi-part messages...



I like multi-part messages!

The workaround would require me to do the framing myself, which would be 
tricky, considering the advanced state of the application.



Thanks anyway.


___
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
http://lists.zeromq.org/mailman/listinfo/zeromq-dev

[zeromq-dev] about Assertion failed: !more (src/fq.cpp:117): just remove the assertion?

2016-10-20 Thread zmqdev

Hi,

Ranjeet Kumar described his troubles with the assertion in

http://lists.zeromq.org/pipermail/zeromq-dev/2016-September/030839.html

According to

http://lists.zeromq.org/pipermail/zeromq-dev/2016-September/030851.html

he apparently solved his problem by commenting the assertion out.

1. Could someone who understands that part of the code advise on the 
safety of removing that assertion?


2. Is the assertion even correct?

Thanks

___
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
http://lists.zeromq.org/mailman/listinfo/zeromq-dev

Re: [zeromq-dev] BUG: SUB socket + multi-part message + disconnect + recv = Assertion failed at src/fq.cpp:117

2016-10-07 Thread zmqdev

On 07.10.2016 10:14, Laughing wrote:

*>>>* Is that mean that socket.disconnect does not disconnect from
all endpoint connected before?


see http://api.zeromq.org/4-1:zmq-disconnect

int zmq_disconnect (void *socket, const char *endpoint);

zmq_disconnect disconnects the socket from the given endpoint *only*.


*>>> *It is a so bad news. I would like to use the disconnect
routine to discard messages in 'REQ/REP' and 'PUSH/PULL' mode.


A ZeroMQ socket holds received messages in a queue.

To empty the queue, you can

a. receive and discard the messages until zmq_poll tells
   you there is no data

b. destroy, then recreate & reconnect the socket

I do not know the internal details, for example whether a socket has a 
queue per connection.


The bug seems to point towards some data management problem between 
queues in the socket.




___
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
http://lists.zeromq.org/mailman/listinfo/zeromq-dev

Re: [zeromq-dev] BUG: SUB socket + multi-part message + disconnect + recv = Assertion failed at src/fq.cpp:117

2016-10-07 Thread zmqdev

On 07.10.2016 08:04, Laughing wrote:


I think that the socket cannot recv message any more after disconnect.


not quite correct: the SUB socket could still be connected to several 
other PUB sockets.


The abort is still present when you modify the test case accordingly.


The first frame is also an unexpected message.


it is not unexpected, as zmq_poll indicates that the SUB socket has data.



___
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
http://lists.zeromq.org/mailman/listinfo/zeromq-dev