Hi,
We are using zeromq (3.2.0) on linux Redhat 6.4-6.7 and widnows 2008-2012
X64. We are
using the EPGM PUB/SUB model. We use one master node(linux) and two slave
node(linux + windows) and one monitor node(windows).
On each node,we use three same pub ports,and each sub the same multicast
ips and ports.two pub ports for business process and one pub port for heart
beat. If the monitor find something wrong(heart beat not receive in 10
seconds) in one node,it will issue instruction to control the destination node
to close the heart beat pub port and reopen it.
From time to time, we have all of our PUB threads eating 100% of each
core. We don't know yet what exactly triggers this phenomenon and therefore
we can't reproduce it.
Here are some callstacks of the three PUB threads.
Thread 1 (process 42443):
#0 0x00007f70b2986b55 in pgm_rate_remaining2 () from libzmq.so.3
#1 0x00007f70b296d43a in pgm_getsockopt () from libzmq.so.3
#2 0x00007f70b2934b68 in zmq::pgm_socket_t::get_tx_timeout() () from
libzmq.so.3
#3 0x00007f70b29333ee in zmq::pgm_sender_t::out_event() () from libzmq.so.3
#4 0x00007f70b2938cac in zmq::poller_base_t::execute_timers() () from
libzmq.so.3
#5 0x00007f70b2926abb in zmq::epoll_t::loop() () from libzmq.so.3
#6 0x00007f70b294747c in thread_routine () from libzmq.so.3
#7 0x0000003bbb407851 in start_thread () from /lib64/libpthread.so.0
#8 0x0000003bbb0e890d in clone () from /lib64/libc.so.6
Thread 1 (process 42443):
#0 0x0000003bbb07a8f3 in malloc () from /lib64/libc.so.6
#1 0x0000003bc94bd09d in operator new(unsigned long) () from
/usr/lib64/libstdc++.so.6
#2 0x00007f70b29390a4 in std::_Rb_tree<unsigned long, std::pair<unsigned long
const, zmq::poller_base_t::timer_info_t>, std::_Select1st<std::pair<unsigned
long const, zmq::poller_base_t::timer_info_t> >, std::less<unsigned long>,
std::allocator<std::pair<unsigned long const, zmq::poller_base_t::timer_info_t>
> >::_M_insert(std::_Rb_tree_node_base*, std::_Rb_tree_node_base*,
std::pair<unsigned long const, zmq::poller_base_t::timer_info_t> const&) ()
from libzmq.so.3
#3 0x00007f70b2939165 in std::_Rb_tree<unsigned long, std::pair<unsigned long
const, zmq::poller_base_t::timer_info_t>, std::_Select1st<std::pair<unsigned
long const, zmq::poller_base_t::timer_info_t> >, std::less<unsigned long>,
std::allocator<std::pair<unsigned long const, zmq::poller_base_t::timer_info_t>
> >::insert_equal(std::pair<unsigned long const,
zmq::poller_base_t::timer_info_t> const&) () from libzmq.so.3
#4 0x00007f70b2938baf in zmq::poller_base_t::add_timer(int,
zmq::i_poll_events*, int) () from libzmq.so.3
#5 0x00007f70b29333fd in zmq::pgm_sender_t::out_event() () from libzmq.so.3
#6 0x00007f70b2938cac in zmq::poller_base_t::execute_timers() () from
libzmq.so.3
#7 0x00007f70b2926abb in zmq::epoll_t::loop() () from libzmq.so.3
#8 0x00007f70b294747c in thread_routine () from libzmq.so.3
#9 0x0000003bbb407851 in start_thread () from /lib64/libpthread.so.0
#10 0x0000003bbb0e890d in clone () from /lib64/libc.so.6
Thread 2 (process 42448):
#0 0x00007f70b2920a82 in zmq::clock_t::rdtsc() () from libzmq.so.3
#1 0x00007f70b2920b69 in zmq::clock_t::now_ms() () from libzmq.so.3
#2 0x00007f70b2938b90 in zmq::poller_base_t::add_timer(int,
zmq::i_poll_events*, int) () from libzmq.so.3
#3 0x00007f70b29333fd in zmq::pgm_sender_t::out_event() () from libzmq.so.3
#4 0x00007f70b2938cac in zmq::poller_base_t::execute_timers() () from
libzmq.so.3
#5 0x00007f70b2926abb in zmq::epoll_t::loop() () from libzmq.so.3
#6 0x00007f70b294747c in thread_routine () from libzmq.so.3
#7 0x0000003bbb407851 in start_thread () from /lib64/libpthread.so.0
#8 0x0000003bbb0e890d in clone () from /lib64/libc.so.6
Thread 2 (process 42448):
#0 0x00007f70b297820b in pgm_send () from libzmq.so.3
#1 0x00007f70b293532f in zmq::pgm_socket_t::send(unsigned char*, unsigned
long) () from libzmq.so.3
#2 0x00007f70b293334e in zmq::pgm_sender_t::out_event() () from libzmq.so.3
#3 0x00007f70b2938cac in zmq::poller_base_t::execute_timers() () from
libzmq.so.3
#4 0x00007f70b2926abb in zmq::epoll_t::loop() () from libzmq.so.3
#5 0x00007f70b294747c in thread_routine () from libzmq.so.3
#6 0x0000003bbb407851 in start_thread () from /lib64/libpthread.so.0
#7 0x0000003bbb0e890d in clone () from /lib64/libc.so.6
Thread 2 (process 42448):
#0 0x00007f70b296bf87 in pgm_rwlock_reader_unlock () from libzmq.so.3
#1 0x00007f70b296de98 in pgm_getsockopt () from libzmq.so.3
#2 0x00007f70b2934b68 in zmq::pgm_socket_t::get_tx_timeout() () from
libzmq.so.3
#3 0x00007f70b29333ee in zmq::pgm_sender_t::out_event() () from libzmq.so.3
#4 0x00007f70b2938cac in zmq::poller_base_t::execute_timers() () from
libzmq.so.3
#5 0x00007f70b2926abb in zmq::epoll_t::loop() () from libzmq.so.3
#6 0x00007f70b294747c in thread_routine () from libzmq.so.3
#7 0x0000003bbb407851 in start_thread () from /lib64/libpthread.so.0
#8 0x0000003bbb0e890d in clone () from /lib64/libc.so.6
Thread 2 (process 42448):
#0 0x0000003bbb40a659 in pthread_mutex_unlock () from /lib64/libpthread.so.0
#1 0x00007f70b297560d in pgm_mutex_unlock () from libzmq.so.3
#2 0x00007f70b29783e6 in pgm_send () from libzmq.so.3
#3 0x00007f70b293532f in zmq::pgm_socket_t::send(unsigned char*, unsigned
long) () from libzmq.so.3
#4 0x00007f70b293334e in zmq::pgm_sender_t::out_event() () from libzmq.so.3
#5 0x00007f70b2938cac in zmq::poller_base_t::execute_timers() () from
libzmq.so.3
#6 0x00007f70b2926abb in zmq::epoll_t::loop() () from libzmq.so.3
#7 0x00007f70b294747c in thread_routine () from libzmq.so.3
#8 0x0000003bbb407851 in start_thread () from /lib64/libpthread.so.0
#9 0x0000003bbb0e890d in clone () from /lib64/libc.so.6
Thread 2 (process 42448):
#0 0x00007f70b2972cae in pgm_atomic_inc32 () from libzmq.so.3
#1 0x00007f70b297846e in pgm_rwspinlock_reader_trylock () from libzmq.so.3
#2 0x00007f70b2978450 in pgm_rwlock_reader_trylock () from libzmq.so.3
#3 0x00007f70b29782d1 in pgm_send () from libzmq.so.3
#4 0x00007f70b293532f in zmq::pgm_socket_t::send(unsigned char*, unsigned
long) () from libzmq.so.3
#5 0x00007f70b293334e in zmq::pgm_sender_t::out_event() () from libzmq.so.3
#6 0x00007f70b2938cac in zmq::poller_base_t::execute_timers() () from
libzmq.so.3
#7 0x00007f70b2926abb in zmq::epoll_t::loop() () from libzmq.so.3
#8 0x00007f70b294747c in thread_routine () from libzmq.so.3
#9 0x0000003bbb407851 in start_thread () from /lib64/libpthread.so.0
#10 0x0000003bbb0e890d in clone () from /lib64/libc.so.6
Thread 3 (process 27985):
#0 0x00007f70b296d4a8 in pgm_getsockopt () from libzmq.so.3
#1 0x00007f70b2934b68 in zmq::pgm_socket_t::get_tx_timeout() () from
libzmq.so.3
#2 0x00007f70b29333ee in zmq::pgm_sender_t::out_event() () from libzmq.so.3
#3 0x00007f70b2938cac in zmq::poller_base_t::execute_timers() () from
libzmq.so.3
#4 0x00007f70b2926abb in zmq::epoll_t::loop() () from libzmq.so.3
#5 0x00007f70b294747c in thread_routine () from libzmq.so.3
#6 0x0000003bbb407851 in start_thread () from /lib64/libpthread.so.0
#7 0x0000003bbb0e890d in clone () from /lib64/libc.so.6
Thread 3 (process 27985):
#0 0x0000003bbb07a8ad in malloc () from /lib64/libc.so.6
#1 0x0000003bc94bd09d in operator new(unsigned long) () from
/usr/lib64/libstdc++.so.6
#2 0x00007f70b29390a4 in std::_Rb_tree<unsigned long, std::pair<unsigned long
const, zmq::poller_base_t::timer_info_t>, std::_Select1st<std::pair<unsigned
long const, zmq::poller_base_t::timer_info_t> >, std::less<unsigned long>,
std::allocator<std::pair<unsigned long const, zmq::poller_base_t::timer_info_t>
> >::_M_insert(std::_Rb_tree_node_base*, std::_Rb_tree_node_base*,
std::pair<unsigned long const, zmq::poller_base_t::timer_info_t> const&) ()
from libzmq.so.3
#3 0x00007f70b2939165 in std::_Rb_tree<unsigned long, std::pair<unsigned long
const, zmq::poller_base_t::timer_info_t>, std::_Select1st<std::pair<unsigned
long const, zmq::poller_base_t::timer_info_t> >, std::less<unsigned long>,
std::allocator<std::pair<unsigned long const, zmq::poller_base_t::timer_info_t>
> >::insert_equal(std::pair<unsigned long const,
zmq::poller_base_t::timer_info_t> const&) () from libzmq.so.3
#4 0x00007f70b2938baf in zmq::poller_base_t::add_timer(int,
zmq::i_poll_events*, int) () from libzmq.so.3
#5 0x00007f70b29333fd in zmq::pgm_sender_t::out_event() () from libzmq.so.3
#6 0x00007f70b2938cac in zmq::poller_base_t::execute_timers() () from
libzmq.so.3
#7 0x00007f70b2926abb in zmq::epoll_t::loop() () from libzmq.so.3
#8 0x00007f70b294747c in thread_routine () from libzmq.so.3
#9 0x0000003bbb407851 in start_thread () from /lib64/libpthread.so.0
#10 0x0000003bbb0e890d in clone () from /lib64/libc.so.6
Thread 3 (process 27985):
#0 0x00007f70b297676f in send_odata_copy () from libzmq.so.3
#1 0x00007f70b29783d6 in pgm_send () from libzmq.so.3
#2 0x00007f70b293532f in zmq::pgm_socket_t::send(unsigned char*, unsigned
long) () from libzmq.so.3
#3 0x00007f70b293334e in zmq::pgm_sender_t::out_event() () from libzmq.so.3
#4 0x00007f70b2938cac in zmq::poller_base_t::execute_timers() () from
libzmq.so.3
#5 0x00007f70b2926abb in zmq::epoll_t::loop() () from libzmq.so.3
#6 0x00007f70b294747c in thread_routine () from libzmq.so.3
#7 0x0000003bbb407851 in start_thread () from /lib64/libpthread.so.0
#8 0x0000003bbb0e890d in clone () from /lib64/libc.so.6
Thread 3 (process 27985):
#0 0x00007f70b2958313 in pgm_tsc_update () from libzmq.so.3
#1 0x00007f70b2986b04 in pgm_rate_remaining2 () from libzmq.so.3
#2 0x00007f70b296d43a in pgm_getsockopt () from libzmq.so.3
#3 0x00007f70b2934b68 in zmq::pgm_socket_t::get_tx_timeout() () from
libzmq.so.3
#4 0x00007f70b29333ee in zmq::pgm_sender_t::out_event() () from libzmq.so.3
#5 0x00007f70b2938cac in zmq::poller_base_t::execute_timers() () from
libzmq.so.3
#6 0x00007f70b2926abb in zmq::epoll_t::loop() () from libzmq.so.3
#7 0x00007f70b294747c in thread_routine () from libzmq.so.3
#8 0x0000003bbb407851 in start_thread () from /lib64/libpthread.so.0
#9 0x0000003bbb0e890d in clone () from /lib64/libc.so.6
we got the recent zeromq source code from github and found some bug fixes might
be relate to the phenomenon.
version: 3de116336907323822266aed603a18633e315228
author: Claudio Freire <[email protected]>
date: 2014/11/11 0:22:54
info:Fix busy-polling bug in pgm_sender causing it to consume CPU cycles
modify: src/pgm_sender.cpp
version: 92bee5a6dbb150115058c494e83702097aa58747
author: Ricardo Catalinas Jiménez <[email protected]>
date: 2012/7/28 17:04:03
info:Fix LIBZMQ-404: zmq_term not truly re-entrant
zmq_term can not safely be re-entered with pgm transport.
Fix proposed by Steven McCoy.
modify: src/zmq.cpp
If our monitor node found something wrong., it may issue instruction to
control the destination node to close the heart beat pub port and reopen it,and
the destination node use zmq_term to close the heart beat pub port.
But we are not sure if these bug fixes can solve our problem and we don't
know how to reproduce the phenomenon of these bug fixes.
Thanks.
dongfang zhang_______________________________________________
zeromq-dev mailing list
[email protected]
https://lists.zeromq.org/mailman/listinfo/zeromq-dev